 
  
  
  
  
 Next: Learn Naive Bayes Text 
Up: Bayesian Learning
 Previous: Estimating Probabilities
 
-  learn ``electronic news articles I find interesting'' or ``pages
on WWW that discuss machine learning topics''
-  two design issues: attribute representation and probability estimates
-  Define an attribute for each word position in the document and
define the value to be the word found in that position
-  Notice that short documents will have fewer attributes than
longer ones
-  Sample document ``Our approach to representing....us any trouble.''
-     
-  The independence assumption states that the word probabilities
for one text position are independent of words that occur in other
positions.  This is clearly incorrect, but in practice naive Bayes
performs remarkably well in many text classification problems.
Requires estimates of   and and where where is the is the th word in the vocabulary. th word in the vocabulary.
-  the first is easy, but the second is too computationally complex.
For 111 text positions and 2 possible targets and 50,000 vocabulary
words, the number of probability estimates is 2 * 111 * 50,000 or about
10 million
-  Additional assumption that the probability of encountering a
specific word is independent of the specific word position so the
complexity is 2 * 50,000 and even more importantly many less training
examples are needed!!
-  the   -estimate is used for estimating probabilities so -estimate is used for estimating probabilities so , where , where is the number of
times word is the number of
times word is found in the document. is found in the document.
 
Patricia Riddle 
Fri May 15 13:00:36 NZST 1998