From Bayesian Priors to Weight Decay

AI & ML Apr 5, 2024

Regularization is a cornerstone of modern machine learning, preventing overfitting by penalizing large weights in a model. Two of the most common forms are $\text{L}_2$ regularization (weight decay) and $\text{L}_1$ regularization (sparsity-inducing penalty), often introduced as ad-hoc modifications to the loss function.

However, these techniques emerge naturally from a Bayesian perspective, where we impose prior distributions on the model weights and derive their effects through Maximum a Posteriori (MAP) estimation.

In this post, we’ll derive weight decay ($\text{L}_2$ regularization) by assuming a Gaussian prior over the weights and show how it leads to the familiar $\text{L}_2$ penalty.

Bayesian Framework and MAP Estimation

In supervised learning, we typically seek a model parameterized by weights $\mathbf{w}$ that minimizes the negative log-likelihood (NLL) of the observed data

$$\mathcal{L}(\mathbf{w}) = -\log p(\mathbf{y} | \mathbf{X}, \mathbf{w})$$

This is the maximum likelihood estimation (MLE) approach, which finds the weights that make the observed data most probable. However, MLE can lead to overfitting, especially when the model is complex relative to the data size.

A Bayesian alternative is to treat the weights $\mathbf{w}$ as random variables and impose a prior distribution $p(\mathbf{w})$.

Instead of maximizing the likelihood alone, we maximize the posterior distribution $p(\mathbf{w} | \mathbf{D})$, which, by Bayes’ rule, is proportional to the product of the likelihood and the prior.

$$p(\mathbf{w} | \mathbf{D}) = \frac{p(\mathbf{D} | \mathbf{w}) \, p(\mathbf{w})}{p(\mathbf{D})}$$

Since $p(\mathbf{D})$ is constant with respect to $\mathbf{w}$, we can ignore it and focus on maximizing the posterior $p(\mathbf{w} | \mathbf{D})$

$$p(\mathbf{w} | \mathbf{D}) \propto p(\mathbf{D} | \mathbf{w}) \, p(\mathbf{w})$$

Taking the negative logarithm (to convert the product into a sum), we obtain the MAP objective:

$$\mathbf{w}_{\text{MAP}} = \arg \min_{\mathbf{w}} \left[ -\log p(\mathbf{D} | \mathbf{w}) - \log p(\mathbf{w}) \right]$$

Here, the first term is the negative log-likelihood (same as in MLE), and the second term is the negative log-prior, which acts as a regularizer.

Deriving $\text{L}_2$ Regularization (Weight Decay) from a Gaussian Prior

Suppose we assume that the weights $\mathbf{w}$ follow a zero-mean Gaussian prior:

$$p(\mathbf{w}) = \mathcal{N}(\mathbf{w}; 0, \sigma^2 \mathbf{I}) = \prod_{i=1}^d \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{\mathbf{w}_i^2}{2 \sigma^2} \right)$$

where $\sigma^2$ is the variance (controlling the strength of the prior), and $\mathbf{I}$ is the identity matrix (assuming independence between weights).

The negative log-prior is then:

$$-\log p(\mathbf{w}) = \frac{1}{2 \sigma^2} \|\mathbf{w}\|_2^2 + \text{const.}$$

Substituting this into the MAP objective, we get

$$\mathbf{w}_{\text{MAP}} = \arg \min_{\mathbf{w}} \left[ -\log p(\mathbf{D} | \mathbf{w}) + \frac{1}{2 \sigma^2} \|\mathbf{w}\|_2^2 \right]$$

This is equivalent to the standard $\text{L}_2$-regularized loss, where the regularization strength $\lambda$ is inversely proportional to the prior variance $\sigma^2$:

$$\lambda = \frac{1}{2 \sigma^2}$$

Thus, weight decay is a MAP estimate under a Gaussian prior.

Deriving $\text{L}_1$ Regularization from Laplace Prior

While a Gaussian prior leads to $\text{L}_2$ regularization, a Laplace prior instead produces $\text{L}_1$ regularization, which encourages sparsity in the weights.

The Laplace distribution is defined as:

$$p(\mathbf{w}) = \prod_{i=1}^d \frac{1}{2b} \exp \left( -\frac{|\mathbf{w}_i|}{b} \right)$$

where $b$ is the scale parameter. The negative log-prior is:

$$-\log p(\mathbf{w}) = \frac{1}{b} \|\mathbf{w}\|_1 + \text{const.}$$

Substituting into the MAP objective:

$$\mathbf{w}_{\text{MAP}} = \arg \min_{\mathbf{w}} \left[ -\log p(\mathbf{D} | \mathbf{w}) + \frac{1}{b} \|\mathbf{w}\|_1 \right]$$

This is the $\text{L}_1$-regularized loss, where the regularization strength is $\lambda = \frac{1}{b}$.

Why $\text{L}_1$ Encourages Sparsity?

The Laplace prior has a sharper peak at zero compared to the Gaussian, meaning it strongly favours small weights and drives many weights to exactly zero. This property makes $\text{L}_1$ regularization useful for feature selection in high-dimensional settings.

Conclusion

In conclusion, we've seen that regularization is not just an ad-hoc trick, rather it arises naturally from Bayesian reasoning. By imposing a Gaussian prior on the weights, we derive $\text{L}_2$ regularization (weight decay), while a Laplace prior leads to $\text{L}_1$ regularization.

Understanding this derivation not only demystifies regularization but also opens the door to more sophisticated priors (e.g., hierarchical Bayes, automatic relevance determination) that can further improve model generalization.