Friday, November 21, 2014

How To Build a Naive Bayes Classifier


How To Build a Naive Bayes Classifier
What we are concerned here is the difference between dependent and independent events, because calculating the intersection (both happening at the same time) depends on it. So for independent events, calculating the intersection is easy:



  • so what’s the probability that an email is spam, given that it contains both “viagra” and “penis”?

To classify an email as spam, you’ll have to calculate the conditional probability by taking hints from the words contained. And the Naive Bayes approach is exactly what I described above: we make the assumption that the occurrence of one word is totally unrelated to the occurrence of another, to simplify the processing and complexity involved.
You simply get the probability for a text to belong to each of the categories you test against. The category with the highest probability for the given text wins:
Do note that above I also eliminated the denominator from our original formula, because it is a constant that we do not need (called evidence).
Because of the underlying limits of floating points, if you’re working with big documents (not the case in this example), you do have to make one important optimization to the above formula:
  • instead of the probabilities of each word, you store the (natural) logarithms of those probabilities
  • instead of multiplying the numbers, you add them instead
So instead of the above formula, if you need this optimization, then use this one:
When classifying emails for spam, it is a good idea to be sure that a certain message is a spam message, otherwise users may get pissed by too many false positives.
Therefore it is a good idea to have thresholds

Read full article from How To Build a Naive Bayes Classifier

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.