Wednesday, September 7, 2011

Decision Trees without Pruning

Do you want to improve generalization of a decision tree? Do no pruning. Try full Bayesian weighting instead. A prediction can be still done in O(tree_depth) steps.

Pruning is an approximation of Bayesian weighting. It gives weight 1.0 to one short tree. And it gives weights 0 to all other trees.

It can be done better. We can give weights to all tree depths. Predictions from shorter trees would get bigger weights. Context Tree Weighting allows that.


Context Tree Weighting is normally used to predict the next bit of a string. For that, it uses the suffix of the string as a context. Different predictions are done from different context lengths. The final prediction is a weighted sum of the predictions. The shorter context lengths get bigger weights.

The context does not need to be a suffix. We can use any list of bits as a context.

context = get_context(model, training_example)


First, a fully grown binary decision tree would be constructed. Information gain or some other greedy heuristic can be used for that.

The decision tree can be now used to select contexts for the Context Tree Weighting model. Different decision tree depths should serve as different context lengths. We would use the path from the tree root to a tree leaf as the context.

def get_context(model, training_example):
    path = []
    node = model.tree.root
    while node is not None:
        child_index = node.choose_child(training_example)
        node = node.children[child_index]

    return path

The path is reversed to have the most important bit on the right. In a suffix the most important bit is also on the right.

Each training example would be used to train the Context Tree Weighting model. The training precomputes values in the model. The precomputed values can be stored in the decision tree nodes. Each example is processed in O(tree_depth) steps.


The decision tree is used as a normal Context Tree Weighting model. A context of the test example is obtained. And a prediction is made based on the context.


The possibility of non-suffix contexts was mentioned in chapter 6 of Approximate Universal Artificial Intelligence.

The weighting of different decision tree depths is simple. I guess, somebody else suggested it before. I haven't found a paper mentioning it. So I share the trick here.

1 comment:

Sean Burton said...

I realize that this post is very old and dead, but I just wanted to share the observation that it is possible to compute the class probabilities for each leaf once we have trained the tree.
Then inference can be done in roughly the same time as for a classical decision tree.

However, I'm not sure how to make CTW scale well in the number of classes. Perhaps at each node we should simply predict the most likely class (or the most likely few) vs the rest rather than assigning probabilities to each class.