Relating Bayesian Inference, Expected Risk Minimization and Maximum Likelihood Estimation

Statistical learning and machine learning are built upon a few foundational principles that govern how we infer patterns from data. Among these, Bayes' Theorem, Empirical Risk Minimization (ERM) and Maximum Likelihood Estimation (MLE) stand out as the most influential frameworks.

While they differ in their philosophical underpinnings and mathematical formulations, they are deeply interconnected, with each framework subsuming the others under specific conditions.

In this blog post, we are not necessarily interested in diving deep into all 3 concepts, but we'd rather want to establish a solid birds-eye view of all 3 frameworks, in addition relating them with each other to see core similarities but also differences.

Bayesian Inference

At the heart of probabilistic reasoning lies Bayes' Theorem, a mathematical formulation that describes how to update our beliefs in light of new evidence.

Unlike frequentist methods, which treat parameters as fixed but unknown quantities, Bayesian inference treats parameters as random variables with probability distributions. This perspective allows us to quantify uncertainty and incorporate prior knowledge into our models.

Bayes' Theorem provides a way to compute the posterior distribution $p(\theta | X)$ of a parameter $\theta$ given observed data $X$.

$$p(\theta | X) = \frac{p(X | \theta) \cdot p(\theta)}{p(X)}$$

where each term has a distinct interpretation:

$p(\theta | X)$: The posterior distribution, representing our updated belief about the parameter $\theta$ after observing the data $X$
$p(X | \theta)$: The likelihood, which measures how probable the observed data $X$ is under a given parameter $\theta$
$p(\theta)$: The prior distribution, encoding our initial bliefs about $\theta$ before seeing any data.
$p(X)$: The evidence, which acts as a normalizing constant to ensure the posterior is a valid probability distribution. It's defined as $p(X) = \int p(X | \theta) \cdot p(\theta) \, d\theta$

While $p(X)$ is often intractable to compute directly, it is not needed for many inference tasks, such as finding the maximum a posteriori (MAP) estimate.

Maximum A Posteriori (MAP) Estimate

The Maximum A Posteriori (MAP) estimate, one of the most common point estimates in Bayesian inference, selects the value of $\theta$ that maximizes the posterior distribution

As we seek the mode of the posterior $p(\theta | X)$, we find that $p(X)$ is just a constant independent of $\theta$. Thus, we can drop $p(X)$ because it doesn’t change the location of the maximum.

Hence, the MAP formula simplifies to

$$\hat{\theta}_{\text{MAP}} = \arg\max_{\theta} p(\theta | X) = \arg\max_{\theta} p(X | \theta) \cdot p(\theta)$$

The MAP estimate is now a product of the likelihood $p(X | \theta)$ and the prior $p(\theta)$.

The MAP estimate is particularly useful when we want a single "best guess" for $\theta$ while still incorporating prior knowledge.

Similarities

Bayesian inference is the most general of the three frameworks because it subsumes both MLE and ERM under specific conditions. For instance:

If we use a uniform prior $p(\theta) \propto 1$, the MAP estimate reduces to the maximum likelihood estimate (MLE).
If we choose a loss function that corresponds to the negative log-posterior, Bayesian inference can be framed as an ERM problem where the goal is to minimize the expected loss under the posterior predictive distribution.

Empirical Risk Minimization

While Bayesian inference provides a principled way to incorporate uncertainty and prior knowledge, Empirical Risk Minimization (ERM) is the dominant paradigm in modern machine learning.

ERM is an optimization-based framework that seeks to minimize the average loss of a model over observed data.

Unlike Bayesian methods, ERM does not require probabilistic assumptions and can be applied to a wide range of loss functions, making it highly versatile.

The core idea behind ERM is to find the model parameters $\theta$ that minimize the empirical risk, which is the average loss over the training data.

Formally, given a loss function $\mathcal{L}(x_i, \theta)$ that measures the discrepancy between the model's prediction and the true value for a data point $x_i$, the ERM objective is:

$$\hat{\theta}_{\text{ERM}} = \arg\min_{\theta} \left[ \frac{1}{n} \sum_{i=1}^n \mathcal{L}(x_i, \theta) \right]$$

Here, $\mathcal{L}(x_i, \theta)$ can be any loss function, such as:

Squared error loss for regression: $\mathcal{L}(x_i, \theta) = (y_i - f_\theta(x_i))^2$.
Cross-entropy loss for classification: $\mathcal{L}(x_i, \theta) = -\sum_{k=1}^K y_{i,k} \log p_{i,k}$, where $p_{i,k}$ is the predicted probability for class $k$.
Hinge loss for support vector machines: $\mathcal{L}(x_i, \theta) = \max(0, 1 - y_i f_\theta(x_i))$.

Regularization

In practice, ERM is often augmented with regularization to prevent overfitting. Regularization adds a penalty term to the empirical risk, encouraging simpler or more stable models. For example, $\text{L}_2$ regularization adds a term proportional to the squared norm of the parameters

$$\hat{\theta}_{\text{ERM}} = \arg\min_{\theta} \left[ \frac{1}{n} \sum_{i=1}^n \mathcal{L}(x_i, \theta) + \lambda \|\theta\|_2^2 \right]$$

This regularized ERM can be interpreted as a Bayesian MAP estimate with a Gaussian prior on $\theta$, illustrating the deep connection between ERM and Bayesian inference.

ERM is the most widely used framework in machine learning for several reasons:

Flexibility: ERM can accommodate any loss function, making it applicable to a vast array of problems, from classification to reinforcement learning.
Scalability: Optimization techniques like stochastic gradient descent (SGD) make ERM highly scalable to large datasets, which is critical for modern deep learning.
No probabilistic assumptions: Unlike MLE, ERM does not require the specification of a probabilistic model, making it more robust to model misspecification.
Theoretical guarantees: Under certain conditions, ERM provides generalization bounds that relate the empirical risk to the true risk (e.g., via VC dimension or Rademacher complexity).

Similarities

While ERM is not inherently probabilistic, it can be connected to Bayesian inference and MLE in the following ways:

If the loss function $\mathcal{L}(x_i, \theta)$ is chosen to be the negative log-likelihood $-\log p(x_i | \theta)$, then ERM reduces to MLE.
If the loss function includes a regularization term derived from a prior distribution (e.g., $\text{L}_2$ regularization corresponds to a Gaussian prior), then ERM becomes equivalent to Bayesian MAP estimation.

Thus, ERM can be seen as a generalization of MLE, where the loss function is not restricted to probabilistic losses.

Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) is a frequentist method for estimating the parameters of a statistical model. Unlike Bayesian inference, MLE treats the parameters $\theta$ as fixed but unknown quantities and seeks the values of $\theta$ that make the observed data most probable. MLE is a special case of both ERM and Bayesian inference, making it a bridge between the two frameworks.

The goal of MLE is to find the parameters $\theta$ that maximize the likelihood function $p(X | \theta)$, which measures how probable the observed data $X$ is under the model parameterized by $\theta$. For independent and identically distributed (i.i.d.) data, the likelihood is the product of the probabilities of each data point.

$$L(\theta | X) = p(X | \theta) = \prod_{i=1}^n p(x_i | \theta)$$

In practice, it is more convenient to work with the log-likelihood, which transforms the product into a sum and is numerically more stable:

$$\ell(\theta | X) = \log L(\theta | X) = \sum_{i=1}^n \log p(x_i | \theta)$$

The MLE estimate is then obtained by maximizing the log-likelihood:

$$\hat{\theta}_{\text{MLE}} = \arg\max_{\theta} \ell(\theta | X)$$

Examples

MLE is used extensively in classical statistics and machine learning.

Gaussian distribution: For data $X = \{x_1, \dots, x_n\}$ assumed to be drawn from a normal distribution $\mathcal{N}(\mu, \sigma^2)$, the MLE estimates for $\mu$ and $\sigma^2$ are:

$$\hat{\mu}_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n x_i, \quad \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n (x_i - \hat{\mu})^2$$

Bernoulli distribution: For binary data $X = \{x_1, \dots, x_n\}$ where $x_i \in \{0, 1\}$, the MLE estimate for the success probability $p$ is:

$$\hat{p}_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n x_i$$

Linear regression: For a linear model $y_i = \theta^T x_i + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \sigma^2)$, the MLE estimate for $\theta$ is the ordinary least squares (OLS) solution:

$$\hat{\theta}_{\text{MLE}} = (X^T X)^{-1} X^T y$$

Similarities

MLE is deeply connected to both ERM and Bayesian inference:

MLE as ERM: If we define the loss function in ERM to be the negative log-likelihood, then MLE is equivalent to ERM:

$$\hat{\theta}_{\text{MLE}} = \arg\min_{\theta} \left[ -\sum_{i=1}^n \log p(x_i | \theta) \right]$$

This shows that MLE is a special case of ERM where the loss function is derived from a probabilistic model.

MLE as MAP with Uniform Prior: In Bayesian inference, if we assume a uniform prior $p(\theta) \propto 1$, the MAP estimate reduces to the MLE estimate:

$$\hat{\theta}_{\text{MAP}} = \arg\max_{\theta} p(X | \theta) \cdot p(\theta) = \arg\max_{\theta} p(X | \theta) = \hat{\theta}_{\text{MLE}}$$

Thus, MLE can be seen as a Bayesian method with a non-informative prior.

Summary

In this post, we explored the three foundational frameworks of statistical learning: Bayesian inference, Empirical Risk Minimization (ERM), and Maximum Likelihood Estimation (MLE).

Bayesian Inference is the most general framework, as it treats parameters as random variables and provides a full posterior distribution. It subsumes both MLE (as MAP with a uniform prior) and ERM (when the loss function corresponds to a probabilistic model).
Empirical Risk Minimization is the workhorse of modern machine learning, offering flexibility and scalability for a wide range of loss functions. MLE is a special case of ERM where the loss is the negative log-likelihood.
Maximum Likelihood Estimation is a frequentist method that provides point estimates by maximizing the likelihood of the observed data. It is equivalent to MAP with a uniform prior and to ERM with a negative log-likelihood loss.