Editing Maximum likelihood estimation (section)

=== Continuous distribution, continuous parameter space ===
For the [[normal distribution]] <math>\mathcal{N}(\mu, \sigma^2)</math> which has [[probability density function]]

<math display="block">f(x\mid \mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}\ }
                               \exp\left(-\frac {(x-\mu)^2}{2\sigma^2} \right), </math>

the corresponding [[probability density function]] for a sample of {{mvar|n}} [[independent identically distributed]] normal random variables (the likelihood) is

<math display="block">f(x_1,\ldots,x_n \mid \mu,\sigma^2) = \prod_{i=1}^n f( x_i\mid  \mu, \sigma^2) = \left( \frac{1}{2\pi\sigma^2} \right)^{n/2} \exp\left( -\frac{ \sum_{i=1}^n (x_i-\mu)^2}{2\sigma^2}\right).</math>

This family of distributions has two parameters: {{math|''θ''&nbsp;{{=}} (''μ'',&nbsp;''σ'')}}; so we maximize the likelihood, <math>\mathcal{L} (\mu,\sigma^2) = f(x_1,\ldots,x_n \mid \mu, \sigma^2)</math>, over both parameters simultaneously, or if possible, individually.

Since the [[natural logarithm|logarithm]] function itself is a [[continuous function|continuous]] [[strictly increasing]] function over the [[range (statistics)|range]] of the likelihood, the values which maximize the likelihood will also maximize its logarithm (the log-likelihood itself is not necessarily strictly increasing). The log-likelihood can be written as follows:

<math display="block">
   \log\Bigl( \mathcal{L} (\mu,\sigma^2)\Bigr) = -\frac{\,n\,}{2} \log(2\pi\sigma^2)
   - \frac{1}{2\sigma^2} \sum_{i=1}^n (\,x_i-\mu\,)^2
</math>

(Note: the log-likelihood is closely related to [[information entropy]] and [[Fisher information]].)

We now compute the derivatives of this log-likelihood as follows.

<math display="block">\begin{align}
0 & = \frac{\partial}{\partial \mu} \log\Bigl( \mathcal{L} (\mu,\sigma^2)\Bigr) =
 0 - \frac{\;-2 n(\bar{x}-\mu)\;}{2\sigma^2}.
\end{align}</math>
where <math> \bar{x} </math> is the [[sample mean]]. This is solved by

<math display="block">\widehat\mu = \bar{x} = \sum^n_{i=1} \frac{\,x_i\,}{n}. </math>

This is indeed the maximum of the function, since it is the only turning point in {{mvar|μ}} and the second derivative is strictly less than zero. Its [[expected value]] is equal to the parameter {{mvar|μ}} of the given distribution,

<math display="block">\operatorname{\mathbb E}\bigl[\;\widehat\mu\;\bigr] = \mu, \, </math>

which means that the maximum likelihood estimator <math>\widehat\mu</math> is unbiased.

Similarly we differentiate the log-likelihood with respect to {{mvar|σ}} and equate to zero:

<math display="block">\begin{align}
0 & = \frac{\partial}{\partial \sigma} \log\Bigl( \mathcal{L} (\mu,\sigma^2)\Bigr) = -\frac{\,n\,}{\sigma} 
   + \frac{1}{\sigma^3} \sum_{i=1}^{n} (\,x_i-\mu\,)^2.
\end{align}</math>

which is solved by

<math display="block">\widehat\sigma^2 = \frac{1}{n} \sum_{i=1}^n(x_i-\mu)^2.</math>

Inserting the estimate <math>\mu = \widehat\mu</math> we obtain

<math display="block">\widehat\sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 = \frac{1}{n}\sum_{i=1}^n x_i^2 -\frac{1}{n^2}\sum_{i=1}^n\sum_{j=1}^n x_i x_j.</math>

To calculate its expected value, it is convenient to rewrite the expression in terms of zero-mean random variables ([[statistical error]])  <math>\delta_i \equiv \mu - x_i</math>. Expressing the estimate in these variables yields

<math display="block">\widehat\sigma^2 = \frac{1}{n} \sum_{i=1}^n (\mu - \delta_i)^2 -\frac{1}{n^2}\sum_{i=1}^n\sum_{j=1}^n (\mu - \delta_i)(\mu - \delta_j).</math>

Simplifying the expression above, utilizing the facts that <math>\operatorname{\mathbb E}\bigl[\;\delta_i\;\bigr] = 0 </math> and <math>\operatorname{E}\bigl[\;\delta_i^2\;\bigr] = \sigma^2 </math>, allows us to obtain

<math display="block">\operatorname{\mathbb E}\bigl[\;\widehat\sigma^2\;\bigr]= \frac{\,n-1\,}{n}\sigma^2.</math>

This means that the estimator <math>\widehat\sigma^2</math> is biased for <math>\sigma^2</math>. It can also be shown that <math>\widehat\sigma</math> is biased for <math>\sigma</math>, but that both <math>\widehat\sigma^2</math> and <math>\widehat\sigma</math> are consistent.

Formally we say that the ''maximum likelihood estimator'' for <math>\theta=(\mu,\sigma^2)</math> is

<math display="block">\widehat{\theta\,} = \left(\widehat{\mu},\widehat{\sigma}^2\right).</math>

In this case the MLEs could be obtained individually.  In general this may not be the case, and the MLEs would have to be obtained simultaneously.

The normal log-likelihood at its maximum takes a particularly simple form:

<math display="block">
   \log\Bigl( \mathcal{L}(\widehat\mu,\widehat\sigma)\Bigr) = \frac{\,-n\;\;}{2} \bigl(\,\log(2\pi\widehat\sigma^2) +1\,\bigr)
</math>

This maximum log-likelihood can be shown to be the same for more general [[least squares]], even for [[non-linear least squares]].  This is often used in determining likelihood-based approximate [[confidence interval]]s and [[confidence region]]s, which are generally more accurate than those using the asymptotic normality discussed above.