Videos for Section 6.7


This section features three videos on least squares. In the first video, we explore least squares for matrix problems, and see that it boils down to the problem of projections that we saw in the last section. In the second video, we tackle the most important application of least squares, namely linear regression. Setting up the equations reduces this to a matrix least squares problem. In the third video, we show how to do multiple regression and other sorts of curve fitting, such as fitting higher-order polynomials, or exponentials, or power laws, to data.

There is an exact solution to the matrix problem $$A {\bf x} = {\bf b}$$ if and only if b is in the column space of $A$. If b isn't in the column space, we can still ask for the value of ${\bf x}$ that brings $A {\bf x}$ as close as possible to ${\bf b}$. This value of ${\bf x}$ is called a least squares solution to $A{\bf x} = {\bf b}$.

Key Theorem: Every least-squares solution to $A {\bf x} = {\bf b}$ is an exact solution to $$A^T A {\bf x} = A^T {\bf b}.$$ Likewise, every exact solution to $A^T A {\bf x} = A^T {\bf b}$ is a least-squares solution to $A{\bf x} = {\bf b}$.

Given a bunch of data points $(x_1,y_1), \ldots, (x_N,y_N)$, we want to find a "best fit" line $y = c_0 + c_1 x$. This is a least-squares solution to the equations \begin{eqnarray*} c_0 + c_1 x_1 &=& y_1 \cr c_0 + c_1 x_2 &=& y_2 \cr &\vdots & \cr c_0 + c_1 x_N &=& y_N \end{eqnarray*} In other words, $$A = \begin{pmatrix} 1 & x_1 \cr \vdots & \vdots \cr 1 & x_N \end{pmatrix},$$ so $$A^T A = \begin{pmatrix} N & \sum x_i \cr \sum x_i & \sum x_i^2 \end{pmatrix}; \qquad A^T {\bf y} = \begin{pmatrix} \sum y_i \cr \sum x_iy_i \end{pmatrix}.$$ The solutions to the $2 \times 2$ system of equations $A^T A {\bf c} = A^T {\bf y}$ can be written in closed form: \begin{eqnarray*} c_0 &=& \frac{(\sum x_i^2)(\sum y_i) - (\sum x_i)(\sum x_iy_i)}{N \sum{x_i^2} - (\sum x_i)^2} = \frac{\hbox{Avg}(x^2)\hbox{Avg}(y) - \hbox{Avg}(xy)\hbox{Avg}(y)}{\hbox{Avg}(x^2)- (\hbox{Avg}(x))^2}\cr c_1 &=& \frac{N(\sum x_iy_i) - (\sum x_i)(\sum y_i)}{N \sum x_i^2 - (\sum x_i)^2} = \frac{\hbox{Avg}(xy)-\hbox{Avg}(x)\hbox{Avg}(y)}{\hbox{Avg}(x^2) - (\hbox{Avg}(x))^2},\cr \end{eqnarray*} where "Avg" means the average value over the sample of $N$ points. It's often easier to think in terms of averages than sums. [Note: In probability and statistics, the average of a quantity $x$ is often denoted $E(x)$ or $\langle x \rangle$ or $\bar x$, but we're already using bars for complex conjugates and angle brackets for inner products, so we'll stick with "Avg".]

The quantity $\hbox{Var}(x)=\hbox{Avg}(x^2) - (\hbox{Avg}(x))^2$ is called the variance of $x$ and comes up a lot in probability and statistics. The quantity $\hbox{Cov}(x,y)= \hbox{Avg}(xy)-\hbox{Avg}(x)\hbox{Avg}(y)$ is called the covariance of $x$ and $y$. The dimensionless quantity $$r^2 = \frac{(\hbox{Cov}(x,y))^2}{\hbox{Var}(x) \hbox{Var}(y)}$$ measures how good a fit our best line is. It gives the fraction of the variation in $y$ that is "explained" by $x$. If you hear about correlations with $r$ values of .2 or .3 or -.2 or -.3, they don't mean much, while correlations with $r = 0.7$ or $0.8$ or $-0.7$ or $-0.8$ are much more meaningful.

  1. Multiple regression: If $z$ is a function of two variables $x$ and $y$, then we look for a model of the form $z = a + b x + c y$. Each data point gives us an equation $a + b x_i + c y_i = z_i$ in the variables $(a,b,c)$. We wind up with $N$ equations in 3 unknowns, and we can find a least-squares solution. Similar ideas work if our output is a function of 3 or more input variables.
  2. Quadratic fit: To fit a parabola to a bunch of data points $(x_i, y_i)$, we use the model $y = c_0 + c_1 x + c_2 x^2$, and then each data point becomes a linear equation $y_i = c_0 + c_1 x_i + c_2 x_i^2$ in $(c_0, c_1, c_2)$. We wind up with $N$ equations in 3 unknowns, and we can find a least-squares solution.
  3. Exponential fits: To fit an exponential $y = c e^{kx}$ to data, we first take logs to get $\ln(y) = \ln(c) + k x$. Every data point gives a linear equation $\ln(y_i) = \ln(c) + k x_i$ in the variables $(\ln(c), k)$, and we use ordinary least-squares with these equations to get the "best" possible values of $(\ln(c),k)$. At the end, we have to exponentiate $\ln(c)$ to get $c$. This is equivalent to plotting the points $(x_i,y_i)$ on semi-log paper and finding the best line.
  4. Power laws: To fit a power law $y= c x^p$, we take logs to get $\ln(y) = \ln(c) + p \ln(x)$. Now we have a linear relationship between $\ln(y)$ and $\ln(x)$ with slope $p$, which we can investigate with ordinary least-squares. This is equivalent to plotting data on a log-log graph and finding the best line.