More About Linear Regression

Jun 29

How can you find the vector whose dot product with the data estimates the linear regression slope?

I use the normal equations. The name refers to the normality (perpendicularity) between the error vector and the predictor vectors in X. But why is the error vector orthogonal to all the vectors in X? How does that work?

When we train a linear regression, we give it a vector of target variables Y and a matrix of multiple predictor columns X. Each element of Y and each row of X represent one training case. In other words, for each training case, the error vector has the different of a given Y from its prediction MX + B.

The problem of fitting the model is one of minimizing error magnitude. A mathematically convenient definition of error magnitude is the sum of squared errors. This choice is justified using the Normal distribution and the likelihood framework.

So what does error minimization have to do with orthogonality. Image the 3D case. You have three data points (x, y). The training matrix X has two columns: a constant (ones) and the values of x in three cases. When Y is written as a linear combination of a constant vector (1, 1, 1) and a variable vector (x1, x2, x3) that minimizes the euclidian norm of the error vector, the error vector must be perpendicular to both the constant vector (1,1,1) and (x1,x2,x3). This is because the two vectors in X form a plane. If Y is in that plane, then error is zero. If Y is not in that plane, the minimum error representation of Y as a combination of X is a orthogonal projection of Y to that plane. The distance projected is the error. The error is orthogonal to the plane of X. In the general case, X has many columns, and error is orthogonal to all of them. Another way to say this is that the error vector comes from the null space of X.

We’re almost there. Given this normal equation, we can write an expression for the best possible parameters as a function of X and Y. The equation, which you are forbidden to remember, is B = (X^T X)^-1 X^T Y. Beta hat equals X transpose X inverse times X transpose Y. In the case of ordinary single variable regression, X contains the constant vector and the variable vector (two columns). The result is the B contains two values, the intercept and the slope. So, the slope we wanted is the dot product of the second row of X transpose X inverse X transpose.

Dan Snyder

Data vis for my hobbies: vinyl records, plants, computers

More About Linear Regression

Agentic Coding FFT in Javascript

Dan Snyder