Question: implement the least-square method to determine the linear function that best fits the data. This method also needs to find the coefficient of determination (R^2) and standard error of estimation (E). Input to this method is a collection of data points (x, y) and the collection's size, a.k.a. number of data points.
Solution: the answer is straight forward. We basically just have to apply the statistics formulas for finding the least-square linear function to the data. If you are not familiar with the formulas and where they come from here is the link for you. Now, let's take a look at the implementation below:
#include<iostream> #include<cmath> using namespace std; struct Point { double x; double y; }; void leastSqrRegression(struct Point* xyCollection, int dataSize) { if (xyCollection == NULL || dataSize == 0) { printf("Empty data set!\n"); return; } double SUMx = 0; //sum of x values double SUMy = 0; //sum of y values double SUMxy = 0; //sum of x * y double SUMxx = 0; //sum of x^2 double SUMres = 0; //sum of squared residue double res = 0; //residue squared double slope = 0; //slope of regression line double y_intercept = 0; //y intercept of regression line double SUM_Yres = 0; //sum of squared of the discrepancies double AVGy = 0; //mean of y double AVGx = 0; //mean of x double Yres = 0; //squared of the discrepancies double Rsqr = 0; //coefficient of determination //calculate various sums for (int i = 0; i < dataSize; i++) { //sum of x SUMx = SUMx + (xyCollection + i)->x; //sum of y SUMy = SUMy + (xyCollection + i)->y; //sum of squared x*y SUMxy = SUMxy + (xyCollection + i)->x * (xyCollection + i)->y; //sum of squared x SUMxx = SUMxx + (xyCollection + i)->x * (xyCollection + i)->x; } //calculate the means of x and y AVGy = SUMy / dataSize; AVGx = SUMx / dataSize; //slope or a1 slope = (dataSize * SUMxy - SUMx * SUMy) / (dataSize * SUMxx - SUMx*SUMx); //y itercept or a0 y_intercept = AVGy - slope * AVGx; printf("x mean(AVGx) = %0.5E\n", AVGx); printf("y mean(AVGy) = %0.5E\n", AVGy); printf ("\n"); printf ("The linear equation that best fits the given data:\n"); printf (" y = %2.8lfx + %2.8f\n", slope, y_intercept); printf ("------------------------------------------------------------\n"); printf (" Original (x,y) (y_i - y_avg)^2 (y_i - a_o - a_1*x_i)^2\n"); printf ("------------------------------------------------------------\n"); //calculate squared residues, their sum etc. for (int i = 0; i < dataSize; i++) { //current (y_i - a0 - a1 * x_i)^2 Yres = pow(((xyCollection + i)->y - y_intercept - (slope * (xyCollection + i)->x)), 2); //sum of (y_i - a0 - a1 * x_i)^2 SUM_Yres += Yres; //current residue squared (y_i - AVGy)^2 res = pow((xyCollection + i)->y - AVGy, 2); //sum of squared residues SUMres += res; printf (" (%0.2f %0.2f) %0.5E %0.5E\n", (xyCollection + i)->x, (xyCollection + i)->y, res, Yres); } //calculate r^2 coefficient of determination Rsqr = (SUMres - SUM_Yres) / SUMres; printf("--------------------------------------------------\n"); printf("Sum of (y_i - y_avg)^2 = %0.5E\t\n", SUMres); printf("Sum of (y_i - a_o - a_1*x_i)^2 = %0.5E\t\n", SUM_Yres); printf("Standard deviation(St) = %0.5E\n", sqrt(SUMres / (dataSize - 1))); printf("Standard error of the estimate(Sr) = %0.5E\t\n", sqrt(SUM_Yres / (dataSize-2))); printf("Coefficent of determination(r^2) = %0.5E\t\n", (SUMres - SUM_Yres)/SUMres); printf("Correlation coefficient(r) = %0.5E\t\n", sqrt(Rsqr)); }
Explanation: assuming that each data point is structured similar to our Point struct, then our method takes an array of Points and the array's size as its parameters.
To advance any further, we must first find these sums:
- SUMx: the sum of all x values of the points
- SUMy: the sum of all y values of the points
- SUMxx: the sum of the square of x values, meaning that we square x values individually and add them togethers
- SUMxy: for each point we multiply its x value with its y value, then add all the results together to find this sum.
The first for loop is used to calculate those sums simultaneously.
After that, we calculate the means of x and y values which equal sum of x values divided by the number of points and sum of y values divided by the number of points respectively.
slope and y_intercept of the best-fit function can then be determined using the means and the sums we found. Thus, the linear function is y = slope*x + y_intercept.
Finally, to calculate the coefficient of determination and standard error of estimation, we need to find the sum of squared standard deviation (SUM_Yres) of each point from the best-fit linear function and sum of the squared residues (SUMres). That's what the second for loop does.
Once, we know sum of squared residues and sum of squared standard deviation, we just apply formulas to find the coefficient and the standard error.
As you can see, the challenge is not writing the code to compute the least-square regression but being able to understand the logic behind that. Why slope, y_intercept, coefficient of determination and standard error of estimation are calculated that way? Where do the formulas come from? Etc..
Answering those questions is beyond the scope of this post, so I leave it to you. I actually took statistics in order to understand the concepts and translate the formulas into code :)
As always, any comment or suggestion is welcome! Bye for now.
This comment has been removed by the author.
ReplyDeleteThank you very much for this piece of code.
ReplyDeleteCould you please give an example of how you create the input struct?
Thx!
Tnx, helped a lot
ReplyDeleteGreat help!!!! I could reproduce this code in C for Microchip microcontroller.
ReplyDeleteThis code has some problems BTW. It fails to give the correct slope when the data on the abscissa becomes large. Around 1e9 will suffice. While this is probably an unusual occurrence, I had the misfortune of running into this bug on a project that I took over. The previous person unfortunately ran across this abortion of code and actually used it. And when it failed randomly, they never bothered to find out why. A cursory examination of Numerical Recipes gives a better algorithm that subtracts the mean of the abscissa data when predicting the slope. All of the complexity of the C++ syntax and the freaking Point data structure hides the simplicity of the LLS algorithm.
ReplyDeleteN=Number of Points:
sx=SUM(x)
sy=SUM(y)
SxOSS=sx/N
Z=x-SxOSS
sx2=SUM(Z**2)
SLOPE=SUM(Z*y)/sx2
INTERCEPT=(sy-sx*SLOPE)/N
Doesnt work
ReplyDelete