Do you want to improve generalization of a decision tree? Do no pruning. Try full Bayesian weighting instead. A prediction can be still done in O(tree_depth) steps.
Pruning is an approximation of Bayesian weighting. It gives weight 1.0 to one short tree. And it gives weights 0 to all other trees.
It can be done better. We can give weights to all tree depths. Predictions from shorter trees would get bigger weights. Context Tree Weighting allows that.
Context
Context Tree Weighting is normally used to predict the next bit of a string. For that, it uses the suffix of the string as a context. Different predictions are done from different context lengths. The final prediction is a weighted sum of the predictions. The shorter context lengths get bigger weights.
The context does not need to be a suffix. We can use any list of bits as a context.
context = get_context(model, training_example)
Training
First, a fully grown binary decision tree would be constructed. Information gain or some other greedy heuristic can be used for that.
The decision tree can be now used to select contexts for the Context Tree Weighting model. Different decision tree depths should serve as different context lengths. We would use the path from the tree root to a tree leaf as the context.
def get_context(model, training_example):
path = []
node = model.tree.root
while node is not None:
child_index = node.choose_child(training_example)
path.append(child_index)
node = node.children[child_index]
path.reverse()
return path
The path is reversed to have the most important bit on the right. In a suffix the most important bit is also on the right.
Each training example would be used to train the Context Tree Weighting model. The training precomputes values in the model. The precomputed values can be stored in the decision tree nodes. Each example is processed in O(tree_depth) steps.
Prediction
The decision tree is used as a normal Context Tree Weighting model. A context of the test example is obtained. And a prediction is made based on the context.
Resources
The possibility of non-suffix contexts was mentioned in chapter 6 of Approximate Universal Artificial Intelligence.
The weighting of different decision tree depths is simple. I guess, somebody else suggested it before. I haven't found a paper mentioning it. So I share the trick here.