Thursday, June 2, 2011

Least-Square Linear Regression of Data Using C++

Question: implement the least-square method to determine the linear function that best fits the data. This method also needs to find the coefficient of determination (R^2) and standard error of estimation (E). Input to this method is a collection of data points (x, y) and the collection's size, a.k.a. number of data points.

Solution: the answer is straight forward. We basically just have to apply the statistics formulas for finding the least-square linear function to the data. If you are not familiar with the formulas and where they come from here is the link for you. Now, let's take a look at the implementation below:

#include<iostream>
#include<cmath>
using namespace std;

struct Point
{
   double x;
   double y;
};

void leastSqrRegression(struct Point* xyCollection, int dataSize)
{
   if (xyCollection == NULL || dataSize == 0)
   {
      printf("Empty data set!\n");
      return;
   }

   double SUMx = 0;     //sum of x values
   double SUMy = 0;     //sum of y values
   double SUMxy = 0;    //sum of x * y
   double SUMxx = 0;    //sum of x^2
   double SUMres = 0;   //sum of squared residue
   double res = 0;      //residue squared
   double slope = 0;    //slope of regression line
   double y_intercept = 0; //y intercept of regression line
   double SUM_Yres = 0; //sum of squared of the discrepancies
   double AVGy = 0;     //mean of y
   double AVGx = 0;     //mean of x
   double Yres = 0;     //squared of the discrepancies
   double Rsqr = 0;     //coefficient of determination

   //calculate various sums 
   for (int i = 0; i < dataSize; i++)
   {
      //sum of x
      SUMx = SUMx + (xyCollection + i)->x;
      //sum of y
      SUMy = SUMy + (xyCollection + i)->y;
      //sum of squared x*y
      SUMxy = SUMxy + (xyCollection + i)->x * (xyCollection + i)->y;
      //sum of squared x
      SUMxx = SUMxx + (xyCollection + i)->x * (xyCollection + i)->x;
   }

   //calculate the means of x and y
   AVGy = SUMy / dataSize;
   AVGx = SUMx / dataSize;

   //slope or a1
   slope = (dataSize * SUMxy - SUMx * SUMy) / (dataSize * SUMxx - SUMx*SUMx);

   //y itercept or a0
   y_intercept = AVGy - slope * AVGx;
   
   printf("x mean(AVGx) = %0.5E\n", AVGx);
   printf("y mean(AVGy) = %0.5E\n", AVGy);

   printf ("\n");
   printf ("The linear equation that best fits the given data:\n");
   printf ("       y = %2.8lfx + %2.8f\n", slope, y_intercept);
   printf ("------------------------------------------------------------\n");
   printf ("   Original (x,y)   (y_i - y_avg)^2     (y_i - a_o - a_1*x_i)^2\n");
   printf ("------------------------------------------------------------\n");

   //calculate squared residues, their sum etc.
   for (int i = 0; i < dataSize; i++) 
   {
      //current (y_i - a0 - a1 * x_i)^2
      Yres = pow(((xyCollection + i)->y - y_intercept - (slope * (xyCollection + i)->x)), 2);

      //sum of (y_i - a0 - a1 * x_i)^2
      SUM_Yres += Yres;

      //current residue squared (y_i - AVGy)^2
      res = pow((xyCollection + i)->y - AVGy, 2);

      //sum of squared residues
      SUMres += res;
      
      printf ("   (%0.2f %0.2f)      %0.5E         %0.5E\n", 
       (xyCollection + i)->x, (xyCollection + i)->y, res, Yres);
   }

   //calculate r^2 coefficient of determination
   Rsqr = (SUMres - SUM_Yres) / SUMres;
   
   printf("--------------------------------------------------\n");
   printf("Sum of (y_i - y_avg)^2 = %0.5E\t\n", SUMres);
   printf("Sum of (y_i - a_o - a_1*x_i)^2 = %0.5E\t\n", SUM_Yres);
   printf("Standard deviation(St) = %0.5E\n", sqrt(SUMres / (dataSize - 1)));
   printf("Standard error of the estimate(Sr) = %0.5E\t\n", sqrt(SUM_Yres / (dataSize-2)));
   printf("Coefficent of determination(r^2) = %0.5E\t\n", (SUMres - SUM_Yres)/SUMres);
   printf("Correlation coefficient(r) = %0.5E\t\n", sqrt(Rsqr));

}

Explanation: assuming that each data point is structured similar to our Point struct, then our method takes an array of Points and the array's size as its parameters.

To advance any further, we must first find these sums:

  • SUMx: the sum of all x values of the points
  • SUMy: the sum of all y values of the points
  • SUMxx: the sum of the square of x values, meaning that we square x values individually and add them togethers
  • SUMxy: for each point we multiply its x value with its y value, then add all the results together to find this sum.

The first for loop is used to calculate those sums simultaneously.

After that, we calculate the means of x and y values which equal sum of x values divided by the number of points and sum of y values divided by the number of points respectively.

slope and y_intercept of the best-fit function can then be determined using the means and the sums we found. Thus, the linear function is y = slope*x + y_intercept.

Finally, to calculate the coefficient of determination and standard error of estimation, we need to find the sum of squared standard deviation (SUM_Yres) of each point from the best-fit linear function and sum of the squared residues (SUMres). That's what the second for loop does.

Once, we know sum of squared residues and sum of squared standard deviation, we just apply formulas to find the coefficient and the standard error.

As you can see, the challenge is not writing the code to compute the least-square regression but being able to understand the logic behind that. Why slope, y_intercept, coefficient of determination and standard error of estimation are calculated that way? Where do the formulas come from? Etc..

Answering those questions is beyond the scope of this post, so I leave it to you. I actually took statistics in order to understand the concepts and translate the formulas into code :)

As always, any comment or suggestion is welcome! Bye for now.

7 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Thank you very much for this piece of code.
    Could you please give an example of how you create the input struct?

    Thx!

    ReplyDelete
  3. Great help!!!! I could reproduce this code in C for Microchip microcontroller.

    ReplyDelete
  4. This code has some problems BTW. It fails to give the correct slope when the data on the abscissa becomes large. Around 1e9 will suffice. While this is probably an unusual occurrence, I had the misfortune of running into this bug on a project that I took over. The previous person unfortunately ran across this abortion of code and actually used it. And when it failed randomly, they never bothered to find out why. A cursory examination of Numerical Recipes gives a better algorithm that subtracts the mean of the abscissa data when predicting the slope. All of the complexity of the C++ syntax and the freaking Point data structure hides the simplicity of the LLS algorithm.

    N=Number of Points:

    sx=SUM(x)
    sy=SUM(y)

    SxOSS=sx/N
    Z=x-SxOSS
    sx2=SUM(Z**2)

    SLOPE=SUM(Z*y)/sx2
    INTERCEPT=(sy-sx*SLOPE)/N

    ReplyDelete
  5. Great and I have a dandy proposal: What Do House Renovations Cost house renovation business

    ReplyDelete