Learning to Classify Text

Next: Learn Naive Bayes Text Up: Bayesian Learning Previous: Estimating Probabilities

learn ``electronic news articles I find interesting'' or ``pages on WWW that discuss machine learning topics''
two design issues: attribute representation and probability estimates
Define an attribute for each word position in the document and define the value to be the word found in that position
Notice that short documents will have fewer attributes than longer ones
Sample document ``Our approach to representing....us any trouble.''
The independence assumption states that the word probabilities for one text position are independent of words that occur in other positions. This is clearly incorrect, but in practice naive Bayes performs remarkably well in many text classification problems. Requires estimates of and where is the th word in the vocabulary.
the first is easy, but the second is too computationally complex. For 111 text positions and 2 possible targets and 50,000 vocabulary words, the number of probability estimates is 2 * 111 * 50,000 or about 10 million
Additional assumption that the probability of encountering a specific word is independent of the specific word position so the complexity is 2 * 50,000 and even more importantly many less training examples are needed!!
the -estimate is used for estimating probabilities so , where is the number of times word is found in the document.

Patricia Riddle
Fri May 15 13:00:36 NZST 1998