Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Naive Bayes classifier
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Probabilistic model == Abstractly, naive Bayes is a [[conditional probability]] model: it assigns probabilities <math>p(C_k \mid x_1, \ldots, x_n)</math> for each of the {{mvar|K}} possible outcomes or ''classes'' <math>C_k</math> given a problem instance to be classified, represented by a vector <math>\mathbf{x} = (x_1, \ldots, x_n)</math> encoding some {{mvar|n}} features (independent variables).<ref>{{cite book | last1 = Narasimha Murty | first1 = M. | last2 = Susheela Devi | first2 = V. | title = Pattern Recognition: An Algorithmic Approach | year=2011 | publisher = Springer | isbn= 978-0857294944 }}</ref> The problem with the above formulation is that if the number of features {{mvar|n}} is large or if a feature can take on a large number of values, then basing such a model on [[Conditional probability table|probability tables]] is infeasible. The model must therefore be reformulated to make it more tractable. Using [[Bayes' theorem]], the conditional probability can be decomposed as: <math display="block">p(C_k \mid \mathbf{x}) = \frac{p(C_k) \ p(\mathbf{x} \mid C_k)}{p(\mathbf{x})} \,</math> In plain English, using [[Bayesian probability]] terminology, the above equation can be written as <math display="block">\text{posterior} = \frac{\text{prior} \times \text{likelihood}}{\text{evidence}} \,</math> In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on <math>C</math> and the values of the features <math>x_i</math> are given, so that the denominator is effectively constant. The numerator is equivalent to the [[joint probability]] model <math display="block">p(C_k, x_1, \ldots, x_n)\,</math> which can be rewritten as follows, using the [[Chain rule (probability)|chain rule]] for repeated applications of the definition of [[conditional probability]]: <math display="block">\begin{align} p(C_k, x_1, \ldots, x_n) & = p(x_1, \ldots, x_n, C_k) \\ & = p(x_1 \mid x_2, \ldots, x_n, C_k) \ p(x_2, \ldots, x_n, C_k) \\ & = p(x_1 \mid x_2, \ldots, x_n, C_k) \ p(x_2 \mid x_3, \ldots, x_n, C_k) \ p(x_3, \ldots, x_n, C_k) \\ & = \cdots \\ & = p(x_1 \mid x_2, \ldots, x_n, C_k) \ p(x_2 \mid x_3, \ldots, x_n, C_k) \cdots p(x_{n-1} \mid x_n, C_k) \ p(x_n \mid C_k) \ p(C_k) \\ \end{align}</math> Now the "naive" [[conditional independence]] assumptions come into play: assume that all features in <math>\mathbf{x}</math> are [[mutually independent]], conditional on the category <math>C_k</math>. Under this assumption, <math display="block">p(x_i \mid x_{i+1}, \ldots ,x_{n}, C_k ) = p(x_i \mid C_k)\,.</math> Thus, the joint model can be expressed as <math display="block">\begin{align} p(C_k \mid x_1, \ldots, x_n) \varpropto\ & p(C_k, x_1, \ldots, x_n) \\ & = p(C_k) \ p(x_1 \mid C_k) \ p(x_2\mid C_k) \ p(x_3\mid C_k) \ \cdots \\ & = p(C_k) \prod_{i=1}^n p(x_i \mid C_k)\,, \end{align}</math> where <math>\varpropto</math> denotes [[Proportionality (mathematics)|proportionality]] since the denominator <math>p(\mathbf{x})</math> is omitted. This means that under the above independence assumptions, the conditional distribution over the class variable <math>C</math> is: <math display="block">p(C_k \mid x_1, \ldots, x_n) = \frac{1}{Z} \ p(C_k) \prod_{i=1}^n p(x_i \mid C_k)</math> where the evidence <math>Z = p(\mathbf{x}) = \sum_k p(C_k) \ p(\mathbf{x} \mid C_k)</math> is a scaling factor dependent only on <math>x_1, \ldots, x_n</math>, that is, a constant if the values of the feature variables are known. === Constructing a classifier from the probability model === The discussion so far has derived the independent feature model, that is, the naive Bayes [[probability model]]. The naive Bayes [[Statistical classification|classifier]] combines this model with a [[decision rule]]. One common rule is to pick the hypothesis that is most probable so as to minimize the probability of misclassification; this is known as the ''[[maximum a posteriori|maximum ''a posteriori'']]'' or ''MAP'' decision rule. The corresponding classifier, a [[Bayes classifier]], is the function that assigns a class label <math>\hat{y} = C_k</math> for some {{mvar|k}} as follows: <math display="block">\hat{y} = \underset{k \in \{1, \ldots, K\}}{\operatorname{argmax}} \ p(C_k) \displaystyle\prod_{i=1}^n p(x_i \mid C_k).</math> [[File:ROC_curves.svg|thumb|[[Likelihood function]]s <math>p(\mathbf{x} \mid Y)</math>, [[Confusion matrix]] and [[ROC curve]]. For the naive Bayes classifier and given that the a priori probabilities <math>p(Y)</math> are the same for all classes, then the [[decision boundary]] (green line) would be placed on the point where the two probability densities intersect, due to {{nowrap|<math>p(Y \mid \mathbf{x}) = \frac{p(Y) \ p(\mathbf{x} \mid Y)}{p(\mathbf{x})} \propto p(\mathbf{x} \mid Y)</math>.}}]]
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
Naive Bayes classifier
(section)
Add topic