Next: Bayes Optimal Classifier
Up: Bayesian Learning
Previous: Gradient Search to Maximize
- let us look at in the light of basic concepts of information theory
-
- This can be interpreted as a statement that short hypotheses are preferred.
- Consider the problem of designing a code to transmit messages
drawn at random, where the probability of encountering message
is . We want the code that minimizes the expected number of
bits we must transmit in order to encode a message drawn at random.
To minimize the expected code length we should assign shorter codes to
more probable messages.
- Shannon and Weaver (1949) showed the optimal code assigns
bits to encode message .
- is the description length of message with
respect to code .
- - the size of the description of the
hypothesis using the optimal representation for encoding the
hypothesis space .
- - the size of the description
of the training data given the hypothesis using the
optimal representation for encoding the data assuming that both
the sender and receiver know the hypothesis .
- To apply this principle we must choose specific representations
and
appropriate for the given learning task.
- Minimum Description Length principle:
- If and are chosen to be optimal encodings for
their respective tasks, then
- Example: apply MDL principle to the problem of learning decision
trees. is an encoding trees where the description length
grows with the number of nodes in the tree and the number of edges.
transmits misclassified examples by identifying which example
is misclassified ( bits, where is the number of
training instances) and its correct classification ( bits,
where is the number of classes)
- MDL principle provides a way of trading off hypothesis
complexity for the number of errors committed by the hypothesis
- So the MDL principle produces a MAP hypothesis if the encodings
and are optimal. But to show that we would need all
the prior probabilities as well as .
- No reason to believe the MDL hypothesis relative to arbitrary
encodings should be preferred!!!!
Next: Bayes Optimal Classifier
Up: Bayesian Learning
Previous: Gradient Search to Maximize
Patricia Riddle
Fri May 15 13:00:36 NZST 1998