Sunday, June 26, 2011

A Bayesian Sequence Predictor

It is possible to use Bayesian inference for sequence prediction. I will show how to predict a continuation of a binary sequence.

For example, we want to know what will be the next bit, after seeing "01101".

Considered Generators

I will not consider all possible sequence generators. All possible sequence generators are all possible computer programs. I will consider a smaller set of generators. I will consider all generators, where the next generated bit depends only on a sequence suffix.

Each suffix will assign a different probability to the next bit:

import random

class OnSuffixGenerator:
    def __init__(self, suffix):
        self.suffix = suffix
        self.next_bit_probability = unknown_function(suffix)

    def generate_next_bit(self, seq):
        assert seq.endswith(self.suffix)
        probability = self.next_bit_probability
        return "1" if probability > random.random() else "0"

Multiple OnSuffixGenerators can be used to generate a sequence. They cooperate to cover different suffixes.

We don't know the used OnSuffixGenerators. We see only the beginning of a generated sequence. And we assume that it was generated by a suffix-based generator.

Brute Force Approach

If we want to predict the next bit of the sequence, we want to know the probability of the next bit being one:

P(NextBit=1|seq) = P(seq + "1") / P(seq)

For that, we need to calculate P(seq + "1") and P(seq). The probability of a sequence is its marginal probability:

P(seq) = sum(P(seq|generator) * P(generator) for
        generator in ALL_GENERATORS)

A brute force approach would calculate the expression directly. The number of all possible suffix-based generators is very big. Each generator is a cooperation of one or more OnSuffixGenerators.

For example, OnSuffixGenerators for "00", "10" and "1" suffixes can cooperate as:

def get_next_bit_generator(seq):
    if seq.endswith("00"):
        return OnSuffixGenerator("00")
    if seq.endswith("10"):
        return OnSuffixGenerator("10")
    if seq.endswith("1"):
        return OnSuffixGenerator("1")

The number of all possible suffix-based generators is bigger than 2**len(seq). The brute force approach would need to iterate over all of them.

Context-Tree Weighting Approach

Fortunately, we can calculate the sequence probability in a faster way. We can consider one OnSuffixGenerator in many cooperations at the same time.


If we have only two possible generators, the probability of a sequence is:

P(seq) = P(seq|generator1) * P(generator1) +
         P(seq|generator2) * P(generator2)

We will give them equal prior probabilities. You can incorporate a different prior belief, if you have it.

P(seq) = P(seq|generator1) * 0.5 +
         P(seq|generator2) * 0.5


We now need to express P(seq|generator1), when considering our possible sequence generators. A considered OnSuffixGenerator is generating only bits after the given suffix. A different generator is used to generate the remaining bits. I will need a notation to express the probability of just bits after the given suffix. I will use P(seq on s) syntax for the probability that the bits after the suffix s were generated by an OnSuffixGenerator(s).

P(seq on s) = P(seq|
        bits_with_different_suffix(seq, s),
P(seq on not_present_suffix) = P(seq|seq) = 1

I will use *s syntax to denote all suffixes ending with the suffix s.

Putting it together

Let's consider the probability of a sequence. For the bits after a suffix, we have only two possible generators:

  1. The bits can be generated by a generator with the given suffix.
  2. Or the bits can be generated by a cooperation of different generators with some longer suffixes.
P(seq on *s) = P(seq on s) * 0.5 +
        P(seq on *0s) * P(seq on *1s) * P(seq on STARTs) * 0.5

The "STARTs" suffix is matching only at the start of the sequence. It is used to cover the bits uncovered by the longer suffixes.

The probability of the whole sequence is the probability of all its bits:

P(seq) = P(seq on *)

The obtained sequence predictor isn't just theoretically nice. Context-Tree Weighting (CTW) variants are doing well on text compression benchmarks.


The recursive P(seq on *s) can reuse previous computations. It can be then computed in O(len(seq)) steps. The computation starts at the biggest depth and carries the partial results to the front.

I implemented the algorithm in Python. You can play with continuation of binary sequences.


  1. The Context-Tree Weighting Method: Basic Properties: This paper introduced the CTW method.
  2. Reflections on "The Context-Tree Weighting Method: Basic Properties": It complements the original paper in explanation.
  3. A Monte-Carlo AIXI Approximation: Joel Veness's C++ implementation of CTW shows clearly, how to implement the algorithm efficiently.
  4. Approximate Universal Artificial Intelligence: Chapter 3 in Joel Veness's PhD thesis explains how the used P(generator) prior is related to the generator complexity. Interesting future ideas are proposed in chapter 6.