Minimum Description Length

Next: Bayes Optimal Classifier Up: Bayesian Learning Previous: Gradient Search to Maximize

Minimum Description Length

let us look at in the light of basic concepts of information theory
This can be interpreted as a statement that short hypotheses are preferred.
Consider the problem of designing a code to transmit messages drawn at random, where the probability of encountering message is . We want the code that minimizes the expected number of bits we must transmit in order to encode a message drawn at random. To minimize the expected code length we should assign shorter codes to more probable messages.
Shannon and Weaver (1949) showed the optimal code assigns bits to encode message .
is the description length of message with respect to code .
- the size of the description of the hypothesis using the optimal representation for encoding the hypothesis space .
- the size of the description of the training data given the hypothesis using the optimal representation for encoding the data assuming that both the sender and receiver know the hypothesis .
To apply this principle we must choose specific representations and appropriate for the given learning task.
Minimum Description Length principle:
If and are chosen to be optimal encodings for their respective tasks, then
Example: apply MDL principle to the problem of learning decision trees. is an encoding trees where the description length grows with the number of nodes in the tree and the number of edges. transmits misclassified examples by identifying which example is misclassified ( bits, where is the number of training instances) and its correct classification ( bits, where is the number of classes)
MDL principle provides a way of trading off hypothesis complexity for the number of errors committed by the hypothesis
So the MDL principle produces a MAP hypothesis if the encodings and are optimal. But to show that we would need all the prior probabilities as well as .
No reason to believe the MDL hypothesis relative to arbitrary encodings should be preferred!!!!

Next: Bayes Optimal Classifier Up: Bayesian Learning Previous: Gradient Search to Maximize

Patricia Riddle
Fri May 15 13:00:36 NZST 1998