Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Naive Bayes classifier
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Document classification=== Here is a worked example of naive Bayesian classification to the [[document classification]] problem. Consider the problem of classifying documents by their content, for example into [[spamming|spam]] and non-spam [[e-mail]]s. Imagine that documents are drawn from a number of classes of documents which can be modeled as sets of words where the (independent) probability that the i-th word of a given document occurs in a document from class ''C'' can be written as <math display="block">p(w_i \mid C)\,</math> (For this treatment, things are further simplified by assuming that words are randomly distributed in the document - that is, words are not dependent on the length of the document, position within the document with relation to other words, or other document-context.) Then the probability that a given document ''D'' contains all of the words <math>w_i</math>, given a class ''C'', is <math display="block">p(D\mid C) = \prod_i p(w_i \mid C)\,</math> The question that has to be answered is: "what is the probability that a given document ''D'' belongs to a given class ''C''?" In other words, what is <math>p(C \mid D)\,</math>? Now [[Conditional probability|by definition]] <math display="block">p(D\mid C)={p(D\cap C)\over p(C)}</math> and <math display="block">p(C \mid D) = {p(D\cap C)\over p(D)}</math> Bayes' theorem manipulates these into a statement of probability in terms of [[likelihood]]. <math display="block">p(C\mid D) = \frac{p(C)\,p(D\mid C)}{p(D)}</math> Assume for the moment that there are only two mutually exclusive classes, ''S'' and Β¬''S'' (e.g. spam and not spam), such that every element (email) is in either one or the other; <math display="block">p(D\mid S)=\prod_i p(w_i \mid S)\,</math> and <math display="block">p(D\mid\neg S)=\prod_i p(w_i\mid\neg S)\,</math> Using the Bayesian result above, one can write: <math display="block">p(S\mid D)={p(S)\over p(D)}\,\prod_i p(w_i \mid S)</math> <math display="block">p(\neg S\mid D)={p(\neg S)\over p(D)}\,\prod_i p(w_i \mid\neg S)</math> Dividing one by the other gives: <math display="block">{p(S\mid D)\over p(\neg S\mid D)}={p(S)\,\prod_i p(w_i \mid S)\over p(\neg S)\,\prod_i p(w_i \mid\neg S)}</math> Which can be re-factored as: <math display="block">{p(S\mid D)\over p(\neg S\mid D)}={p(S)\over p(\neg S)}\,\prod_i {p(w_i \mid S)\over p(w_i \mid\neg S)}</math> Thus, the probability ratio p(''S'' | ''D'') / p(Β¬''S'' | ''D'') can be expressed in terms of a series of [[likelihood function|likelihood ratios]]. The actual probability p(''S'' | ''D'') can be easily computed from log (p(''S'' | ''D'') / p(Β¬''S'' | ''D'')) based on the observation that p(''S'' | ''D'') + p(Β¬''S'' | ''D'') = 1. Taking the [[logarithm]] of all these ratios, one obtains: <math display="block">\ln{p(S\mid D)\over p(\neg S\mid D)}=\ln{p(S)\over p(\neg S)}+\sum_i \ln{p(w_i\mid S)\over p(w_i\mid\neg S)}</math> (This technique of "[[log-likelihood ratio]]s" is a common technique in statistics. In the case of two mutually exclusive alternatives (such as this example), the conversion of a log-likelihood ratio to a probability takes the form of a [[sigmoid curve]]: see [[logit]] for details.) Finally, the document can be classified as follows. It is spam if <math>p(S\mid D) > p(\neg S\mid D)</math> (i. e., <math>\ln{p(S\mid D) \over p(\neg S\mid D)} > 0</math>), otherwise it is not spam.
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
Naive Bayes classifier
(section)
Add topic