Editing Maximum likelihood estimation (section)

=== Relation to Bayesian inference ===
A maximum likelihood estimator coincides with the [[Maximum a posteriori|most probable]] [[Bayesian estimator]] given a [[Uniform distribution (continuous)|uniform]] [[prior probability|prior distribution]] on the [[parameter space|parameters]]. Indeed, the [[maximum a posteriori estimate]] is the parameter {{mvar|θ}} that maximizes the probability of {{mvar|θ}} given the data, given by Bayes' theorem:

<math display="block">
    \operatorname{\mathbb P}(\theta\mid x_1,x_2,\ldots,x_n) = \frac{f(x_1,x_2,\ldots,x_n\mid\theta)\operatorname{\mathbb P}(\theta)}{\operatorname{\mathbb P}(x_1,x_2,\ldots,x_n)}
  </math>

where <math>\operatorname{\mathbb P}(\theta)</math> is the prior distribution for the parameter {{mvar|θ}} and where <math>\operatorname{\mathbb P}(x_1,x_2,\ldots,x_n)</math> is the probability of the data averaged over all parameters. Since the denominator is independent of {{mvar|θ}}, the Bayesian estimator is obtained by maximizing <math>f(x_1,x_2,\ldots,x_n\mid\theta)\operatorname{\mathbb P}(\theta)</math> with respect to {{mvar|θ}}. If we further assume that the prior <math>\operatorname{\mathbb P}(\theta)</math> is a uniform distribution, the Bayesian estimator is obtained by maximizing the likelihood function <math>f(x_1,x_2,\ldots,x_n\mid\theta)</math>. Thus the Bayesian estimator coincides with the maximum likelihood estimator for a uniform prior distribution <math>\operatorname{\mathbb P}(\theta)</math>.

==== Application of maximum-likelihood estimation in Bayes decision theory ====
In many practical applications in [[machine learning]], maximum-likelihood estimation is used as the model for parameter estimation.

The Bayesian Decision theory is about designing a classifier that minimizes total expected risk, especially, when the costs (the loss function) associated with different decisions are equal, the classifier is minimizing the error over the whole distribution.<ref>{{cite web |last=Christensen |first=Henrikt I. |title=Pattern Recognition |publisher=Georgia Tech |series=Bayesian Decision Theory - CS 7616 |url=https://www.cc.gatech.edu/~hic/CS7616/pdf/lecture2.pdf |type=lecture}}</ref>

Thus, the Bayes Decision Rule is stated as
:"decide <math>\;w_1\;</math> if <math>~\operatorname{\mathbb P}(w_1|x) \; > \; \operatorname{\mathbb P}(w_2|x)~;~</math> otherwise decide <math>\;w_2\;</math>"
where <math>\;w_1\,, w_2\;</math> are predictions of different classes. From a perspective of minimizing error, it can also be stated as
<math display="block">w = \underset{ w }{\operatorname{arg\;max}} \; \int_{-\infty}^\infty \operatorname{\mathbb P}(\text{ error}\mid x)\operatorname{\mathbb P}(x)\,\operatorname{d}x~</math>
where
<math display="block">\operatorname{\mathbb P}(\text{ error}\mid x) = \operatorname{\mathbb P}(w_1\mid x)~</math>
if we decide <math>\;w_2\;</math> and <math>\;\operatorname{\mathbb P}(\text{ error}\mid x) = \operatorname{\mathbb P}(w_2\mid x)\;</math> if we decide <math>\;w_1\;.</math>

By applying [[Bayes' theorem]]
<math display="block">\operatorname{\mathbb P}(w_i \mid x) = \frac{\operatorname{\mathbb P}(x \mid w_i) \operatorname{\mathbb P}(w_i)}{\operatorname{\mathbb P}(x)}</math>,
and if we further assume the zero-or-one loss function, which is a same loss for all errors, the Bayes Decision rule can be reformulated as:
<math display="block">h_\text{Bayes} = \underset{ w }{\operatorname{arg\;max}} \, \bigl[\, \operatorname{\mathbb P}(x\mid w)\,\operatorname{\mathbb P}(w) \,\bigr]\;,</math>
where <math>h_\text{Bayes}</math> is the prediction and <math>\;\operatorname{\mathbb P}(w)\;</math> is the [[prior probability]].