3.5. Regression Variants

We made many modeling assumptions when we first introduced linear regression. In this section, we’ll go over what happens when we change some of those fundamental assumptions and observe how the model changes.

3.5.1. Ridge Regression

When we first approached regression, we used Maximum Likelihood Estimation (MLE). Implicitly, this means that we assumed that every single choice of weights \(\theta\), were equally likely. In practice, that might now be a good assumption to make. For example, imagine that your data is comprised of features that all have values in the range \([-10, 10]\). It then seems extremely unlikely that one of the optimal weight values would be a large value like 10,000. However, MLE assumes that this is as equally likely to be the true model as a weight value of 1. We can encode this intuition into a prior over the possible weight values.

Ridge regression assumes a Gaussian prior to the weights. We assume that \(\theta_j \sim \mathcal{N}(0, \sigma^2_\theta)\) or that each weight is a random variable distributed according to a normal distribution with mean zero and variance \(\sigma^2_\theta\). We can then apply Maximum A Posteriori (MAP) estimation.

\[\theta^* = \arg \max_\theta P(\{(x_i, y_i)\}_{i=1}^N | \theta) P(\theta) = \prod_{i=1}^N f_\epsilon(y_i - \theta^\top x_i) \prod_{j=1}^k f_\theta(\theta_j)\]

By assuming each component of the weight vector is independent, we can multiply together all the probabilities of each component to get the total probability of the entire weight vector. We use the same probability model as before for the other components. As before, we now take the log of the objective and reduce.

\[\begin{split}\begin{align*} &= \arg \max_\theta \prod_{i=1}^N \frac{1}{\sqrt{2\pi} \sigma} \exp\left(-\frac{(y_i - \theta^\top x_i)^2}{2\sigma^2}\right) \prod_{j=1}^k \frac{1}{\sqrt{2\pi} \sigma_\theta} \exp\left(-\frac{\theta_j^2}{2\sigma_\theta^2}\right) \\ &= \arg \max_\theta \ln \left( \prod_{i=1}^N \frac{1}{\sqrt{2\pi} \sigma} \exp\left(-\frac{(y_i - \theta^\top x_i)^2}{2\sigma^2}\right) \prod_{j=1}^k \frac{1}{\sqrt{2\pi} \sigma_\theta} \exp\left(-\frac{\theta_j^2}{2\sigma_\theta^2}\right) \right) \\ &= \arg \max_\theta \sum_{i=1}^N \ln \left(\frac{1}{\sqrt{2\pi} \sigma}\right) -\frac{(y_i - \theta x_i)^2}{2\sigma^2} + \sum_{j=1}^k \ln \left(\frac{1}{\sqrt{2\pi} \sigma_\theta}\right) -\frac{\theta_j^2}{2\sigma_\theta^2} \\ &= \arg \max_\theta - \frac{1}{2\sigma^2} \sum_{i=1}^N (y_i - \theta x_i)^2 - \frac{1}{2\sigma_\theta^2} \sum_{j=1}^k \theta_j^2 \\ &= \arg \max_\theta - \sum_{i=1}^N (y_i - \theta x_i)^2 - \frac{\sigma^2}{\sigma_\theta^2} \sum_{j=1}^k \theta_j^2 \\ &= \arg \min_\theta \sum_{i=1}^N (y_i - \theta x_i)^2 + \frac{\sigma^2}{\sigma_\theta^2} \sum_{j=1}^k \theta_j^2 \\ &= \arg \min_\theta ||y - X\theta||_2^2 + \lambda ||\theta||_2^2 \end{align*}\end{split}\]

Where \(\lambda = (\frac{\sigma}{\sigma_\theta})^2\) represents the square of the ratio between the variance in the data points and the variance in the weights. Typically, \(\lambda\) is a value less than one. As \(\lambda\) increases, the weights tends towards zero. As \(\lambda\) decreases, we approach the original regression solution.

We can solve for the objective via calculus once again.

\[\begin{split}\begin{align} \nabla_\theta ||X\theta - y||_2^2 + \lambda ||\theta||_2^2 &= \nabla_\theta (X\theta - y)^\top (X\theta - y) + \lambda \theta^\top \theta = \nabla_\theta (\theta^\top X^\top X \theta - \theta^\top X^\top y - y^\top X\theta + y^\top y + \lambda \theta^\top \theta) \\ &= \nabla_\theta (\theta^\top X^\top X \theta - 2 y^\top X \theta + y^\top y + \lambda \theta^\top \theta) \\ &= 2 X^\top X \theta - 2 X^\top y + 2\lambda \theta = 0 \\ &\implies X^\top X \theta + \lambda\theta = (X^\top X + \lambda I) \theta = X^\top y = 0 \\ &\implies \theta^* = (X^\top X + \lambda I)^{-1} X^\top y \end{align}\end{split}\]

Note that as we add a non negative value of \(\lambda\) to the diagonal of \(X^\top X\), it is necessarily invertible (though we won’t go into why).

3.5.2. Lasso Regression

The Gaussian distribution was just one choice of prior. If we think weights are more likely to be zero, we can in fact choose another prior distribution. Lasso regression is a specific variant where we assume that the weights are distributed according to a laplace distribution, which is essentially a two sided exponential. This means that the most probable weight value is zero, indicating that many of the features in our dataset don’t actually contribute to the output.

\[\theta_j \sim \frac{1}{2b} \exp\left(-\frac{|x|}{b}\right)\]

We then apply MAP again.

\[\begin{split}\begin{align*} &= \arg \max_\theta \prod_{i=1}^N \frac{1}{\sqrt{2\pi} \sigma} \exp\left(-\frac{(y_i - \theta^\top x_i)^2}{2\sigma^2}\right) \prod_{j=1}^k \frac{1}{2b} \exp\left(-\frac{|\theta_j|}{b}\right) \\ &= \arg \max_\theta \ln \left( \prod_{i=1}^N \frac{1}{\sqrt{2\pi} \sigma} \exp\left(-\frac{(y_i - \theta^\top x_i)^2}{2\sigma^2}\right) \prod_{j=1}^k \frac{1}{2b} \exp\left(-\frac{|\theta_j|}{b}\right) \right) \\ &= \arg \max_\theta \sum_{i=1}^N \ln \left(\frac{1}{\sqrt{2\pi} \sigma}\right) -\frac{(y_i - \theta x_i)^2}{2\sigma^2} + \sum_{j=1}^k \ln \left(\frac{1}{2b}\right) -\frac{|\theta_j|}{b} \\ &= \arg \max_\theta - \frac{1}{2\sigma^2} \sum_{i=1}^N (y_i - \theta x_i)^2 - \frac{1}{b} \sum_{j=1}^k |\theta_j| \\ &= \arg \min_\theta \sum_{i=1}^N (y_i - \theta x_i)^2 + \frac{2\sigma^2}{b} \sum_{j=1}^k |\theta_j| \\ &= \arg \min_\theta ||y - X\theta||_2^2 + \lambda ||\theta||_1 \end{align*}\end{split}\]

We have introduced a new norm here, namely the \(l1\) norm which is equal to the sum of all the absolute values of the components of a vector. Though lasso regression is still a convex problem, it doesn’t have a nice closed form solution like ridge regression, so we’ll leave the problem to gradient descent or an equivalent optimizer like coordinate descent. In practice, Lasso regression tends to set many weights to zero as the prior distribution can interpreted as making each weight value exponentially less likely the further away it is from 0. Thus, the solution to lasso regression is often said to be sparse.

3.5.3. Weighted Regression

Before we challenged assumptions about our the prior over the different weight vectors. Now, we’ll challenge assumptions we made about the data. In particular, we assumed for each data point \((x_i, y_i)\) the associated error \(\epsilon_i\) was independent and identically distributed. We will continue to assume the the errors of each data point are independent and be Gaussian with mean zero, but will assume that they may not have the same variance. Specifically, each error \(\epsilon_i \sim \mathcal{N}(0, \sigma_i)\). This essentially means that we “trust” some data points more than others and believe they are supposedly more reliable samples or estimations of the true relationship. Let’s apply MLE to this new scenario and see what happens.

\[\begin{split}\begin{align*} & \arg \max_\theta P(\{(x_i, y_i)\}_{i=1}^N | \theta) = \prod_{i=1}^N P( (x_i, y_i) | \theta) = \prod_{i=1}^N f_{\epsilon_i}(y_i - \theta x_i) \\ &= \arg \max_\theta \prod_{i=1}^N \frac{1}{\sqrt{2\pi} \sigma_i} \exp\left(-\frac{(y_i - \theta x_i)^2}{2\sigma_i^2}\right) \\ &= \arg \max_\theta \ln \left( \prod_{i=1}^N \frac{1}{\sqrt{2\pi} \sigma_i} \exp\left(-\frac{(y_i - \theta x_i)^2}{2\sigma_i^2}\right) \right) \\ &= \arg \max_\theta \sum_{i=1}^N \ln \left(\frac{1}{\sqrt{2\pi} \sigma_i}\right) -\frac{(y_i - \theta x_i)^2}{2\sigma_i^2} \\ &= \arg \max_\theta \sum_{i=1}^N \ln \left(\frac{1}{\sqrt{2\pi} \sigma_i}\right) - \sum_{i=1}^N \frac{1}{2\sigma_i^2} (y_i - \theta x_i)^2 \\ &= \arg \min_\theta \sum_{i=1}^N \frac{1}{\sigma_i^2}(y_i - \theta x_i)^2 \end{align*}\end{split}\]

Here we observe that we get a new weighted objective where the “importance” of a sample is inversely proportional to its variance, so a lower variance sample has a higher contribution to the total cost function. We can down the objective in matrix form as follows:

\[\begin{split}\arg \min_\theta (y-X\theta)^\top \Omega (y-X\theta), \quad \Omega = \begin{bmatrix} 1/\sigma_1^2 & & 0 \\ & \ddots & \\ 0 & & 1/\sigma_N^2 \end{bmatrix}\end{split}\]

The closed form solution is \(\theta^* = (X^\top \Omega X)^{-1} X^\top \Omega y\). In practice, we choose the weights of each data point as we don’t know each \(\sigma_i\).