Editing Naive Bayes classifier (section)

===Multinomial naive Bayes ===
With a multinomial event model, samples (feature vectors) represent the frequencies with which certain events have been generated by a [[Multinomial distribution|multinomial]] <math>(p_1, \dots, p_n)</math> where <math>p_i</math> is the probability that event {{mvar|i}} occurs (or {{mvar|K}} such multinomials in the multiclass case). A feature vector <math>\mathbf{x} = (x_1, \dots, x_n)</math> is then a [[histogram]], with <math>x_i</math> counting the number of times event {{mvar|i}} was observed in a particular instance. This is the event model typically used for document classification, with events representing the occurrence of a word in a single document (see [[bag of words]] assumption).<ref>{{cite book |last1=James |first1=Gareth |last2=Witten |first2=Daniela |last3=Hastie |first3=Trevor |last4=Tibshirani |first4=Robert |title=An introduction to statistical learning: with applications in R |date=2021 |publisher=Springer |location=New York, NY |isbn=978-1-0716-1418-1 |page=157 |edition=Second |doi=10.1007/978-1-0716-1418-1 |url=https://link.springer.com/book/10.1007/978-1-0716-1418-1 |access-date=10 November 2024}}</ref> The likelihood of observing a histogram {{math|'''x'''}} is given by:
<math display="block">
p(\mathbf{x} \mid C_k) = \frac{(\sum_{i=1}^n x_i)!}{\prod_{i=1}^n x_i !} \prod_{i=1}^n {p_{ki}}^{x_i}
</math>
where <math>p_{ki} := p(i \mid C_k)</math>. 

The multinomial naive Bayes classifier becomes a [[linear classifier]] when expressed in log-space:<ref name="rennie">{{cite conference |last1=Rennie |first1=J. |last2=Shih |first2=L. |last3=Teevan |first3=J. |last4=Karger |first4=D. |title=Tackling the poor assumptions of naive Bayes classifiers |conference=ICML |year=2003 |url=http://people.csail.mit.edu/~jrennie/papers/icml03-nb.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://people.csail.mit.edu/~jrennie/papers/icml03-nb.pdf |archive-date=2022-10-09 |url-status=live}}</ref>
<math display="block">
\begin{align}
\log p(C_k \mid \mathbf{x}) & \varpropto \log \left( p(C_k) \prod_{i=1}^n {p_{ki}}^{x_i} \right) \\
                       & = \log p(C_k) + \sum_{i=1}^n x_i \cdot \log p_{ki}                 \\
                       & = b + \mathbf{w}_k^\top \mathbf{x}
\end{align}
</math>
where <math>b = \log p(C_k)</math> and <math>w_{ki} = \log p_{ki}</math>. Estimating the parameters in log space is advantageous since multiplying a large number of small values can lead to significant rounding error. Applying a log transform reduces the effect of this rounding error.

If a given class and feature value never occur together in the training data, then the frequency-based probability estimate will be zero, because the probability estimate is directly proportional to the number of occurrences of a feature's value. This is problematic because it will wipe out all information in the other probabilities when they are multiplied. Therefore, it is often desirable to incorporate a small-sample correction, called [[pseudocount]], in all probability estimates such that no probability is ever set to be exactly zero. This way of [[regularization (mathematics)|regularizing]] naive Bayes is called [[Laplace smoothing]] when the pseudocount is one, and [[Lidstone smoothing]] in the general case.<!-- TODO: cite Jurafsky and Martin for this -->

Rennie ''et al.'' discuss problems with the multinomial assumption in the context of document classification and possible ways to alleviate those problems, including the use of [[tf–idf]] weights instead of raw term frequencies and document length normalization, to produce a naive Bayes classifier that is competitive with [[support vector machine]]s.<ref name="rennie"/>