It is amazing to see that learning from historical data is theoretically solved. For example, we could calculate the probability that all crows are black when N black crows were seen previously:
P(all_crows_are_black|seen_N_black_crows) = P(seen_N_black_crows|all_crows_are_black) * P(all_crows_are_black) * 1/P(seen_N_black_crows) = 1 * P(all_crows_are_black) * 1/P(seen_N_black_crows)
A different model could predict that 90% of crows are black. Its probability after seeing N black crows would be:
P(90%_of_crows_are_black|seen_N_black_crows) = 0.9**N * P(90%_of_crows_are_black) * 1/P(seen_N_black_crows)
The 1/P(seen_N_black_crows)
constant is not known.
We could interpret it as a normalization constant.
It ensures that the sum of probabilities of all possible models is 1.
Or we could ignore it if just comparing the probabilities of different models.
Many models could have non-zero probability when given a small history. We should use them all when making a prediction. A prediction is just the probability of unseen data based on the seen data. That is calculated by:
P(data|old_data) = sum(P(data|h,old_data) * P(h|old_data) for h in ALL_MODELS)
This approach is completely general. It could be used for non-independent samples, time series, everything. We would then work with models that predict such non-independent data or time series.
Additional Resources
- AI: A Modern Approach chapters 13, 14 and 20 give an introduction to probabilities and Bayesian learning.
- On Universal Prediction and Bayesian Confirmation by Marcus Hutter. It hints how to estimate the P(model|no_data_yet) probabilities. Simpler models are preferred.