Sunday, November 16, 2014

Developing a Naive Bayes Text Classifier in JAVA | Datumbox


Developing a Naive Bayes Text Classifier in JAVA | Datumbox
The code is written in JAVA and can be downloaded directly from Github. It is licensed under GPLv3 so feel free to use it, modify it and redistribute it freely.
The Text Classifier implements the Multinomial Naive Bayes model along with the Chisquare Feature Selection algorithm.
1. NaiveBayes Class
This is the main part of the Text Classifier. It implements methods such as train() and predict() which are responsible for training a classifier and using it for predictions. It should be noted that this class is also responsible for calling the appropriate external methods to preprocess and tokenize the document before training/prediction.

2. NaiveBayesKnowledgeBase Object
The output of training is a NaiveBayesKnowledgeBase Object which stores all the necessary information and probabilities that are used by the Naive Bayes Classifier.

3. Document Object
Both the training and the prediction texts in the implementation are internally stored as Document Objects. The Document Object stores all the tokens (words) of the document, their statistics and the target classification of the document.

4. FeatureStats Object
The FeatureStats Object stores several statistics that are generated during the Feature Extraction phase. Such statistics are the Joint counts of Features and Class (from which the joint probabilities and likelihoods are estimated), the Class counts (from which the priors are evaluated if none are given as input) and the total number of observations used for training.

5. FeatureExtraction Class
This is the class which is responsible for performing feature extraction. It should be noted that since this class calculates internally several of the statistics that are actually required by the classification algorithm in the later stage, all these stats are cached and returned in a FeatureStats Object to avoid their recalculation.

4. Additional Feature Selection Methods:

This implementation uses the Chisquare feature selection algorithm to select the most appropriate features for the classification. As we saw in a previous article, the Chisquare feature selection method is a good technique which relays on statistics to select the appropriate features, nevertheless it tends to give higher scores on rare features that only appear in one of the categories. Improvements can be made removing noisy/rare features before proceeding to feature selection or by implementing additional methods such as the Mutual Information that we discussed on the aforementioned article.
Read full article from Developing a Naive Bayes Text Classifier in JAVA | Datumbox

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.