Extras
Workshop 1 - Linear Algebra Method of Solving the Problem of Least Squares
Some of you may have asked, why would you use SGD when other methods for solving for least squares exist, such as simply finding the psuedoinverse? This is a very good question, as solving for the least squares problem can be easily done through the psuedoinverse. In the section below, I'll explain
- How you can use the psuedoinverse to find the weights of a linear least squares problem
- Why this method is in many cases can be worse than using SGD to find the weights.
Bridging the Gap: Linear Algbra and Least Squares
Let's get into the math of it.
Say if we wanted to find the line between two points in a two dimensional space i.e. we only want to fit a line to two points in our dataset. This would be easy, simply solve the system of linear equations to find a line that intersects these two points. In other words, we can just row reduce the matrix below.
or, solving for x, the parameters of the line which predicts y, is the following:
Therefore, we can get a line such that mx = y
What about if we have so many vectors that we aren't able to make a line that intersects these points, for example with the dataset we've shown above? Row reducing this matrix is impossible as the resulting augmented matrix will have free row variables, meaning there is no solution and the matrix is not invertible (Keep in mind that a matrix needs to be invertible in order to find x).
This type of matrix is known as a tall, or overdetermined matrix. However, even though there is no exact solution because the matrix isn't invertible, that doesn't mean that there isn't an approximate solution that we can solve for.
Psuedoinverse: Solving Overdetermined Matrices
One way to solve for this is finding the psuedoinverse of the matrix to solve for the closest solution. The pseudoinverse is found with the following:
We want to invert the right side in order to isolate x. If A was square, we could simply multiply both sides by the inverse of the original matrix to find our weights, x. However, we can make the matrix square through a simple algebraic transformation. We can multiply both sides by the matrix transposed, A^T, such that
We can then multiply both sides by the inverse of the new matrix on the left, (A^TA)^{-1} such that
This will get us the weights of the line, x, that can be used to predict values.
Summary
Just to give a bit of explanations for the variables we used above:
\hat{y} : The predicted value that approximates y for an input sample
y : The true value for each sample that we are trying to predict
A : The input matrix. i.e. the features as the columns with each row being a single sample of your data.
x : The weights of the final approximate least squares equation