3.2. One Dimension

From the last section, our problem is as follows:

Given data points \(\{(x_i, y_i)\}_{i=1}^N\) such that \(Y = \theta X + \epsilon\) where \(\epsilon \sim \mathcal{N}(0,\sigma^2)\), determine the value of \(\theta\) that maximizes the probability of guessing the correct value of \(Y\) given \(X\).

Earlier, we learned two techniques for solving these “maximization of probability problems”, namely MLE and MAP. We can reduce the linear regression problem to choosing the most likely value of \(\theta\) given the data \(\{(x_i, y_i)\}_{i=1}^N\) we observed. This can be interpreted as choosing the most likely linear relationship of the data. Formally \(\arg \max_\theta P(\theta | \{(x_i, y_i)\}_{i=1}^N)\). For now, let’s assume that we don’t have any prior knowledge about the likelihood of different values of \(\theta\). As such, we will use MLE to derive the optimal value of \(\theta\).

3.2.1. MLE For Linear Regression

The MLE problem can be written as follows: \(\theta^* = \arg \max_\theta P(\{(x_i, y_i)\}_{i=1}^N | \theta)\). You might initially be at a loss as to how we can reduce this expression. But we need to remember a few crucial observations from before.

  1. Each data point \((x_i, y_i)\) is drawn independently from \(p_{data}\).

  2. By our model of the errors, \(y_i = \theta x_i + \epsilon_i\), where \(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\). Thus, it follows that \(y_i - \theta x_i \sim \mathcal{N}(0, \sigma^2)\). This actually gives the probability of observing a given data point according to the error model. We’ll use both of these facts to reduce the problem. Let’s look at the probability term \(P(\{(x_i, y_i)\}_{i=1}^N | \theta)\).

\[P(\{(x_i, y_i)\}_{i=1}^N | \theta) = \prod_{i=1}^N P( (x_i, y_i) | \theta) = \prod_{i=1}^N f_\epsilon(y_i - \theta x_i) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi} \sigma} \exp\left(-\frac{(y_i - \theta x_i)^2}{2\sigma^2}\right)\]

In the first step, we apply the independence of all the samples. In the second step, we apply the Gaussian PDF of the error model. This has a bunch of products which can be a little unwieldy. A common trick to deal with this is taking the log of the objective function. We can do this as the log is a monotonic function.

\[\begin{split}\begin{align*} \arg \max_\theta \prod_{i=1}^N \frac{1}{\sqrt{2\pi} \sigma} \exp\left(-\frac{(y_i - \theta x_i)^2}{2\sigma^2}\right) &= \arg \max_\theta \ln \left( \prod_{i=1}^N \frac{1}{\sqrt{2\pi} \sigma} \exp\left(-\frac{(y_i - \theta x_i)^2}{2\sigma^2}\right) \right) \\ &= \arg \max_\theta \sum_{i=1}^N \ln \left(\frac{1}{\sqrt{2\pi} \sigma}\right) -\frac{(y_i - \theta x_i)^2}{2\sigma^2} \\ &= \arg \max_\theta N \ln \left(\frac{1}{\sqrt{2\pi} \sigma}\right) - \frac{1}{2\sigma^2} \sum_{i=1}^N (y_i - \theta x_i)^2 \\ &= \arg \max_\theta - \frac{1}{2\sigma^2} \sum_{i=1}^N (y_i - \theta x_i)^2 \\ &= \arg \max_\theta - \sum_{i=1}^N (y_i - \theta x_i)^2 \\ &= \arg \min_\theta \sum_{i=1}^N (y_i - \theta x_i)^2 \\ \end{align*}\end{split}\]

In this section of math, we start by transforming the objective with a natural log, slowly reduce terms using log properties, and then ultimately remove terms that do not direct affect the result of the optimization. Note that the variance of the errors completely drops out of the distribution!

In the end, we get a remarkable yet intuitive result: maximizing the probability of the data is equivalent to minimizing the squared errors of our model, \(f(x) = \)\theta x$.

3.2.2. Loss Functions

We have additionally just derived our first loss function. A loss function \(L(f(x),y)\) measures how close a guess \(\hat{y} = f(x)\) is to the actual value \(y\). Note that loss functions can always be applied one data point at time. The loss function \(L(f(x),y) = (y - f(x))^2\) is referred to as squared error and is derived through the assumption that errors are Gaussian distributed with mean zero.

In machine learning, we often write our objectives functions as cost functions (denoted \(J\)), equal to a sum of loss functions.

\[ \arg \min_\theta J(\theta) = \sum_{i=1}^N L(f(x_i), y_i)\]

We’ll use this formulation for the rest of this section (and the book). As you can see, the linear regression problem exactly follows this form.

3.2.3. Deriving the Solution

We have two ways of arriving at the solution. We could use gradient descent to optimize the objective, or we could solve algebraically. Linear regression yields a rare but special type of optimization problem as it is unconstrained, convex, and has a closed form solution. As the objective function is convex, any local minimum will be a global minimum and thus every critical point will be an optima. Here we’ll solve for the optimal solution by taking the derivative of the cost function.

\[\frac{\partial J}{\partial \theta} = \sum_{i=1}^N 2(y_i - \theta x_i) (-x_i) = 2\theta \sum_{i=1}^N x_i^2 - 2 \sum_{i=1}^N x_i y_i = 0\]

Thus, we arrive at the conclusion that \(\theta^* = \frac{\sum_{i=1}^N x_i y_i}{\sum_{i=1}^N x_i^2}\)

3.2.4. Generalized Linear Models (Bonus)

We spent a lot of time in this section giving a justification for why we model errors \(\epsilon\) as being normally distributed. Though under the assumptions we made, Gaussian error models tend to perform well, sometimes errors may follow a different distribution. Modeling errors with an arbitrary (exponential family) probability distribution yields whats called a generalized linear model.