Maximum Likelihood vs Maximum A Posteriori estimate

11 minute read

Published:

When trying to model a probability distribution, we want to find the parameters such that the resulting distribution best describes the observed data.

TL;DR

If the parameters are \(\theta\), and the dataset is \(D\), then the Maximum Likelihood Estimate (MLE) is the value of \(\theta\) for which the observed data is the most likely (maximize \(P(D | \theta)\)), while Maximum A Posteriori (MAP) estimate is the value of \(\theta\) which is the most likely given the data (maximize \(P(\theta | D)\)).

An example

Let’s say we have a biased coin and we want to estimate the probability of Heads in a toss for that coin. To estimate the parameter \(\theta\) (in this case a scalar value which is the probability of Heads) we toss the coin 5 times and record the observations.

Suppose we got the following result – HHHTH. Since 4 out of the 5 tosses were Heads, it is reasonable to estimate that the coin has an 80% probability of getting Heads, i.e., \(\theta = 0.8\)

But we also know that if we had a perfectly fair coin, the chances of getting 4 out of 5 Heads, or even all 5 Heads are not insignificant, so we might be picking up a lot of noise in our estimate due to the small number of observations we have made.

One obvious solution is to get more data which will give us more confidence in our estimates. If we tossed the coin 500 times and got 400 Heads, we would be much more confident in our estimate of 0.8 probability for Heads.

When that is not possible, we might want some way to encode our prior knowledge or assumptions about the coin into the calculation for model parameters. For example, maybe for a previous experiment, we had already tossed the same coin 10 times and we got the following result – HTHHTTHTTT.

With 4 Heads and 6 Tails, the coin appears to behave more like a fair coin than a heavily biased one. If we also include these prior results in our current calculations, we have 8 Heads and 7 Tails out of a total of 15 tosses. With these results, our new estimate is \(\theta = 0.533\)

Maximum Likelihood Estimate

MLE or Maximum Likelihood Estimate is the value we get by maximizing the likelihood \(P(D | \theta)\). This means we are finding the parameters which give us the highest probability of observing the data given the parameters.

In the coin toss example, \(P(D | \theta)\) is the probability of observing HHHTH. Since under the probabilistic model that we have assumed the tosses are independent, \(P(D | \theta)\) is the probability of observing 4 Heads and 1 Tail in 5 tosses for a given value of theta.

\[P(D | \theta) = {5 \choose 4} \theta^4 (1-\theta)\]

When we maximize this term, we get:

\[\theta^\ast = \frac{4}{5} = 0.8\]

Maximum A Posteriori

MAP or Maximum A Posteriori probability estimate is the value we get by maximizing the posterior \(P(\theta | D)\) of the distribution for the probabilistic model.

Using Bayes’ rule, we can write

\[P(\theta | D) = \frac{P(D | \theta) P(\theta)}{P(D)}\]

Since \(P(D)\) does not change with \(\theta\), we can maximize \(P(\theta | D)\) by maximizing the numerator from the RHS: \(P(D | \theta)P(\theta)\)

\(P(D | \theta)\) is the term that we maximized in the MLE case, i.e., the likelihood. \(P(\theta)\) is the prior, which is the information or beliefs we have about the distribution prior to observing the data.

This effectively enables us to convert our prior beliefs into virtual data points in our dataset. In the coin toss example, let’s say our prior is a beta distribution of the following form:

\[P(\theta) = beta(\theta; 5,7) = \frac{\Gamma(12)}{\Gamma(5)\Gamma(7)} \theta^4 (1-\theta)^6\]

Then, maximizing the posterior, we get the MAP estimate of theta to be

\[\theta^\ast = 8/15 = 0.533\]

While there is no technical limitation on the prior distribution, and it can take the form of any valid probability distribution, we chose this particular prior because beta distribution is the conjugate prior distribution of our likelihood, which follows a binomial distribution. The prior \(beta(\alpha, \beta)\) translates to adding \(\alpha - 1\) virtual Heads and \(\beta - 1\) virtual Tails in our dataset, and thus makes the calculation of the posterior a lot easier.

For a binomial distribution likelihood \(B(n,k)\) and a prior of the form \(beta(\alpha, \beta)\), the MLE and MAP estimates are:

\[\theta^{\ast}_{MLE} = \frac{k}{n}\] \[\theta^{\ast}_{MAP} = \frac{k + \alpha - 1}{n + \alpha + \beta – 2}\]

This way, the prior allows us to encode the information that we had about the model before observing the data. Conversely,the new observations allow us to update those beliefs using the likelihood to give us the new posterior distribution.

In our case, high values of \(\alpha\) and \(\beta\) mean we have very strong prior beliefs about the distribution and thus a lot of new data would be required to change the posterior in a significant manner. A prior of \(beta(1, 1)\) (i.e., uniform distribution), on the other hand, means we have no prior information about the distribution and will accept the likelihood as our posterior i.e., MLE and MAP results will be the same.

How they are similar

From the Bayesian inference perspective, MLE is a special case of MAP where the prior is a uniform distribution i.e., where we have no prior information about the parameters and all valid values of the parameters are equally likely.

MLE and MAP are both point estimators, i.e., they only give a single fixed value for \(\theta\). So we know that the MLE/MAP estimate of \(\theta\) is \(\theta^\ast\), but we do not get any information about how much more the value of the likelihood/posterior is for \(\theta^{\ast}\) compared to any other value of \(\theta\).

On the other hand, Bayesian inference allows us to calculate the complete probability distribution of the parameters. For example, in the coin toss experiment mentioned above, observing 4 Heads out of 5 tosses and observing 400 Heads out of 500 tosses would both give us the MLE estimate of \(\theta^{\ast} = 0.8\) but full Bayesian inference will tell us that in the first case a large spread of values of \(theta\) has high probability density while in the second case it is extremely likely that the true value of \(\theta\) is very close to 0.8

While Bayesian inference allows us to get a lot more information about the parameters of a model, often MLE or MAP estimates are good enough for many use cases while being much easier to compute.

How they differ

The only difference between MLE and MAP is the prior. If you have any prior beliefs in the system then you would want to use that information in the form of a prior to calculate the MAP estimate.

If, on the other hand, there is no prior information available, or if the amount of new observations is so large that the prior is insignificant, then MLE can be used which will also be easier to calculate compared to the posterior.

Other estimates

There are other point estimators. The EAP or Expected A Posteriori estimate calculates the expected value (as opposed to the mode in the case of MAP) of the posterior distribution.

Linear Regression Example

Let us calculate the MLE and MAP parameters for a linear regression example. Suppose there is a dataset with \(n\) data points of \(p\) dimensional features \(X\) and real-valued target \(\pmb{y}\). We want to find a linear function \(f(\cdot)\) that best explains this data. This model is characterized by the weight vector \(\pmb{\beta}\).

In this dataset, the \(i^{th}\) data point is composed of the feature vector \(\pmb{x_i}\) and target \(y_i\).

We take the target to be

\[y_i = f(\pmb{x_i}; \pmb{\beta}) + \epsilon_i\] \[= \pmb{x_i}^T\pmb{\beta} + \epsilon_i\]

Where \(\epsilon\) is the noise term. Therefore, we are modeling the target as being a function of the features plus an additional noise term, which accounts for the randomness due to any terms that we haven’t accounted for as well as any intrinsic stochastic processes that might be happening.

Assumptions

  1. All the data points are independent and identically distributed (i.i.d.) samples taken from the underlying distribution.
  2. The vectors \(\pmb{x_i}\) are deterministic and not random variables.
  3. The noise \(\epsilon\) is Gaussian with zero mean and constant variance of \(\sigma^2\) for all the data points i.e., \( \epsilon_i \sim \mathcal{N}(0, \sigma^2) \)

Finding MLE parameters

Our goal is to find the parameters \(\pmb{\beta}\) that maximize the likelihood of observing the data given the parameters.

For regression task, we want to maximize \(P(\pmb{y} | X; \pmb{\beta})\), which is a product of \(P(y_i | \pmb{x_i}; \pmb{\beta})\) for all the data points in the dataset.

As we can see from the equation above, \(P(y_i | \pmb{x_i}; \pmb{\beta})\) is normally distributed with a mean of \(\pmb{x_i}^T\pmb{\beta}\) and variance \(\sigma^2\).

\[y_i \sim \mathcal{N}(\pmb{x_i}^T\pmb{\beta}, \sigma^2)\] \[P(y_i) = (2\pi\sigma^2)^{-\frac{1}{2}} exp(-\frac{(y_i - \pmb{x_i}^T\pmb{\beta})^2}{2\sigma^2})\]

Thus, the likelihood becomes:

\[L = \prod_{i=1}^{n}(2\pi\sigma^2)^{-\frac{1}{2}} exp(-\frac{(y_i - \pmb{x_i}^T\pmb{\beta})^2}{2\sigma^2})\]

Maximizing the above likelihood is the same as minimizing the negative log likelihood:

\[NLL = \sum_{i=1}^{n} (\frac{1}{2}\ln{(2\pi\sigma^2)} + \frac{(y_i - \pmb{x_i}^T\pmb{\beta})^2}{2\sigma^2})\]

Ignoring the constant term, we want to minimize

\[\sum_{i=1}^{n} (y_i - \pmb{x_i}^T\pmb{\beta})^2\]

That is, we want to minimize the sum of squared errors, which is a very common loss function in linear models and even some large neural networks.

MAP parameters

The MAP estimate is calculated by using the prior along with the likelihood term to get the posterior.

Let us consider the linear model

\[y_i = \pmb{x_i}^T\pmb{\beta} + \epsilon_i\]

We assume the prior distribution of the model weights to be i.i.d. Gaussians

\[\beta_k \sim \mathcal{N}(0, \frac{1}{2\lambda})\]

Thus, \( P(\beta_k) = \sqrt{\frac{\lambda}{\pi}}exp(-\lambda\beta_k^2) \)

The likelihood has the same form as in the MLE case. Thus, ignoring the constant multiplicative terms, the posterior that we want to maximize becomes:

\[exp(\sum_{i=1}^{n}-\frac{(y_i - \pmb{x_i}^T\pmb{\beta})^2}{2\sigma^2})\cdot\prod_{k=1}^{p}exp(-\lambda\beta_k^2)\]

Equivalently, we can minimize the negative log of the posterior:

\[\sum_{i=1}^{n}\frac{(y_i - \pmb{x_i}^T\pmb{\beta})^2}{2\sigma^2} + \sum_{k=1}^{p}\lambda\beta_k^2\]

Which is equivalent to maximizing

\[\sum_{i=1}^{n}(y_i - \pmb{x_i}^T\pmb{\beta})^2 + \sum_{k=1}^{p}\tilde{\lambda}\beta_k^2\]

We can see that calculating the MAP estimate for linear regression with Gaussian prior is the same as minimizing the sum of squares along with L2 regularization of the model weights \(\pmb{\beta}\) with the hyperparameter \(\tilde{\lambda} = 2\lambda\sigma^2\).

References

  1. IEOR 165 – Engineering Statistics, Quality Control, and Forecasting – Lect 6-8 (berkeley.edu)

  2. Maximum likelihood estimation - Wikipedia

  3. A Gentle Introduction to Maximum Likelihood Estimation and Maximum A Posteriori Estimation | by Shota Horii | Towards Data Science

  4. MLE, MAP and Bayesian Inference. Grasp the idea of Bayesian inference | by Shota Horii | Towards Data Science

  5. A Gentle Introduction to Linear Regression With Maximum Likelihood Estimation (machinelearningmastery.com)

Categories:

Leave a Comment