Sunday, November 16, 2014

Using Feature Selection Methods in Text Classification | Datumbox


Using Feature Selection Methods in Text Classification | Datumbox
In text classification, the feature selection is the process of selecting a specific subset of the terms of the training set and using only them in the classification algorithm. The feature selection process takes place before the training of the classifier.

The main advantages for using feature selection algorithms are the facts that it reduces the dimension of our data, it makes the training faster and it can improve accuracy by removing noisy features. As a consequence feature selection can help us to avoid overfitting.

The basic selection algorithm for selecting the k best features is presented below (Manning et al, 2008):
On the next sections we present two different feature selection algorithms: the Mutual Information and the Chi Square.

Mutual Information

One of the most common feature selection methods is the Mutual Information of term t in class c (Manning et al, 2008). This measures how much information the presence or absence of a particular term contributes to making the correct classification decision on c. The mutual information can be calculated by using the following formula:
[1]
In our calculations, since we use the Maximum Likelihood Estimates of the probabilities we can use the following equation:
[2]
Where N is the total number of documents, Ntcare the counts of documents that have the values et(occurrence of term t in the document; it takes the value 1 or 0) and ec(occurrence of document in class c; it takes the value 1 or 0) that indicated by two subscripts,  and . Finally we must note that all the aforementioned variables take non-negative values.

Chi Square

Another common feature selection method is the Chi Square. The x2 test is used in statistics, among other things, to test the independence of two events. More specifically in feature selection we use it to test whether the occurrence of a specific term and the occurrence of a specific class are independent. Thus we estimate the following quantity for each term and we rank them by their score:
[3]
High scores on x2 indicate that the null hypothesis (H0) of independence should be rejected and thus that the occurrence of the term and class are dependent. If they are dependent then we select the feature for the text classification.
The above formula can be rewritten as follows:
[4]
If we use the Chi Square method, we should select only a predefined number of features that have a x2 test score larger than 10.83 which indicates statistical significance at the 0.001 level.
Read full article from Using Feature Selection Methods in Text Classification | Datumbox

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.