Friday, December 30, 2011

Results from Sum-Product Networks

I'm now experimenting with sum-product networks. I will show you some obtained results.

Predicting a half of the image

The above picture shows:

  1. The original images in the first row.
  2. The visible part of the input in the second row.
  3. The expected pixel intensities predicted by the network.

The shown images are from a validation set. The sum-product network was trained on different images. The training was done on 800 images from the notMNIST dataset.

The trained sum-product network is a probabilistic model. It can compute the probability of something, when given some evidence. Here, the network is asked to compute the expected intensity of each pixel. The given evidence is the right half of the image.

Some of the expected pixel intensities look bad. Some of them look good. If multiple values are possible for a pixel, the expected intensity will be a weighted average of the possibilities.

Predicting every second column

The expected pixel intensities look better when giving every second column as the evidence. The number of pixels to predict is still one half of the image.

The better result can be explained by:

  1. The set of all probable whole images is better pruned, when showing pixels from all parts of the image. The expected pixel intensity will be an average from fewer possibilities.

  2. The used network structure knows about locality of pixels. Nearby image regions are connected at the bottom layers of the network.

The whole network gives high probability to some patterns. And it gives lower probability to other patterns. The top sum node has multiple possible patterns as children. The children of a sum node are product nodes. The product nodes split the pattern to smaller sub-patterns.

If two nearby sub-patterns occur together, it is easy to connect the sub-patterns by a product node. The new pattern means: sub_pattern1 AND sub_pattern2. The new pattern will occur often, if the sub-patterns occur often together. Such patterns are discovered when training the network.

Monday, October 24, 2011

The Big Picture

It is helpful to view all machine-learning methods as approximations of Bayesian inference. It allows to devise new approximations or to make some approximations more precise.

The following two videos show the unified view. They explain it better than I would do.

Note that the goal is to minimize the expected loss. The expectation is over all possible examples. Modeling P(X, Y) can help to have small loss on unseen examples.

Even SVM can be viewed as a probabilistic model.

Sunday, October 2, 2011

Intro to Sum-Product Networks

In 2011, a new probabilistic model was proposed. It is fast. It can represent many probability functions. And its learning prefers simple explanations. I will show you how the model looks like. You can then read the original paper for details: Sum-Product Networks: A New Deep Architecture.

Computing Probabilities Quickly

Sum-product networks allow to compute the probability of an event quickly. They were first invented as a data structure for the quick computation.

Network with one variable

A sum-product network is a directed acyclic graph. It has alternating layers of sum and product nodes. Only the edges below a sum node have weights. The weights are normalized to sum to 1.

To compute the probability of an evidence, the network is evaluated from bottom up.

P(X1=1) = net.eval_with(x1=1, not_x1=0)
    = P(X1=1) * 1 + P(X1=0) * 0

Network with two independent variables

When two variables are independent, their join probability can be represented concisely. We don't need to store the probabilities of all their combinations. It is enough to store the factors: P(X1=1), P(X2=1).

The joint probability is calculated by evaluation the network:

P(X1=1, X2=1) = net.eval_with(
        x1=1, not_x1=0,
        x2=1, not_x2=0)
    = (P(X1=1) * 1 + P(X1=0) * 0) * (P(X2=1) * 1 + P(X2=0) * 0)
    = P(X1=1) * P(X2=1)

It is also possible to calculate the probability of a subset of variables. Naively, we would evaluate the network multiple times:

P(X1=1) = P(X1=1, X2=0) + P(X1=1, X2=1)

It can be done faster. We want the network to sum all the possible branches. We can do that by setting both x2 and not_x2 to 1. It is then enough to evaluate the network just once.

P(X1=1) = net.eval_with(
        x1=1, not_x1=0,
        x2=1, not_x2=1)
    = (P(X1=1) * 1 + P(X1=0) * 0) * (P(X2=1) * 1 + P(X2=0) * 1)
    = P(X1=1) * 1.0

The computation is done in O(num_edges) steps.

Network with conditional independence

Conditional independence also helps to make the network concise. In the above network, variables X1 and X2 are independent when given X3. The value of X3 switches between the branches of the network.

P(X1=1, X2=1, X3=1) = net.eval_with(
        x1=1, not_x1=0,
        x2=1, not_x2=0,
        x3=1, not_x3=0)
    = (P(X1=1|X3=1) * P(X2=1|X3=1) * 1) * P(X3=1)

This network can represent the same probabilities as a Naive Bayes model.

Expressive Power

Any Bayesian network can be converted to a sum-product network. The resulting sum-product network may have many edges. In the worst case, the number of edges in the sum-product network will be proportional to the time complexity of computing probabilities in the Bayesian network (i.e., O(num_vars * 2**treewidth)).

Some probability functions can be represented more concisely by a sum-product network. For example, the sum-product network can omit edges with zero weights. And a sum-product network can reuse nodes. The network does not need to be a tree. A node can be reused by multiple parents.

The concise sum-product network can be learned directly from training examples.


Learning is done by maximizing likelihood P(training_examples|net). The learned network should give high probability to each seen example.

Requirements for network structure

At the start, we don't know the network structure and the edge weights. We have some requirements for the network structure:

  • The number of edges should be small. It forces the network to find general explanations for the seen examples. That should improve generalization. The network will have also small number of weights. A small number of examples will be enough to estimate the weights.

    The small number of edges will also make the network evaluation fast.

  • The network should produce valid probabilities. The probability of an event should be equal to the sum of the probabilities of all the included outcomes.

Fortunately, the validity of the produced probabilities can be ensured by two simple constrains:

  1. All children of a sum node must use the same variables. No variable can be missing and no variable can be appended. Only a product node can split the set of used variables to subsets.

  2. A product node must not multiply xi with not_xi.

Learning steps

A valid network can be learned by:

  1. Define a valid structure of the network. The structure can have many layers and many edges. The initial weights will be zero.

  2. Increment the probability of each training example, by incrementing some weights.

  3. Only the edges with non-zero weights will be retained in the final network.

The finding of the weights to increment is also fast. It is similar to evaluating of the network. The most active edges should have their weights incremented.

The details can be seen in the SPN source code. The learning stops at a local maximum. Ask me for an explanation, if you are interested.

Wednesday, September 7, 2011

Decision Trees without Pruning

Do you want to improve generalization of a decision tree? Do no pruning. Try full Bayesian weighting instead. A prediction can be still done in O(tree_depth) steps.

Pruning is an approximation of Bayesian weighting. It gives weight 1.0 to one short tree. And it gives weights 0 to all other trees.

It can be done better. We can give weights to all tree depths. Predictions from shorter trees would get bigger weights. Context Tree Weighting allows that.


Context Tree Weighting is normally used to predict the next bit of a string. For that, it uses the suffix of the string as a context. Different predictions are done from different context lengths. The final prediction is a weighted sum of the predictions. The shorter context lengths get bigger weights.

The context does not need to be a suffix. We can use any list of bits as a context.

context = get_context(model, training_example)


First, a fully grown binary decision tree would be constructed. Information gain or some other greedy heuristic can be used for that.

The decision tree can be now used to select contexts for the Context Tree Weighting model. Different decision tree depths should serve as different context lengths. We would use the path from the tree root to a tree leaf as the context.

def get_context(model, training_example):
    path = []
    node = model.tree.root
    while node is not None:
        child_index = node.choose_child(training_example)
        node = node.children[child_index]

    return path

The path is reversed to have the most important bit on the right. In a suffix the most important bit is also on the right.

Each training example would be used to train the Context Tree Weighting model. The training precomputes values in the model. The precomputed values can be stored in the decision tree nodes. Each example is processed in O(tree_depth) steps.


The decision tree is used as a normal Context Tree Weighting model. A context of the test example is obtained. And a prediction is made based on the context.


The possibility of non-suffix contexts was mentioned in chapter 6 of Approximate Universal Artificial Intelligence.

The weighting of different decision tree depths is simple. I guess, somebody else suggested it before. I haven't found a paper mentioning it. So I share the trick here.

Sunday, August 7, 2011

To the edge of our galaxy

It is possible to travel 500+ light years in 80 years. And you don't need to travel faster than light. Stephen Hawking mentioned the possibility at the end of his article.

Here's how

Use a giant rocket. It needs to carry rocket fuel for 6 years of accelerating. After the 6 years, you would travel at 99% of the speed of light. You would be moving very fast in space. You would be moving less in time. Your clock would go slower than clocks on Earth.

Sunday, June 26, 2011

A Bayesian Sequence Predictor

It is possible to use Bayesian inference for sequence prediction. I will show how to predict a continuation of a binary sequence.

For example, we want to know what will be the next bit, after seeing "01101".

Considered Generators

I will not consider all possible sequence generators. All possible sequence generators are all possible computer programs. I will consider a smaller set of generators. I will consider all generators, where the next generated bit depends only on a sequence suffix.

Each suffix will assign a different probability to the next bit:

import random

class OnSuffixGenerator:
    def __init__(self, suffix):
        self.suffix = suffix
        self.next_bit_probability = unknown_function(suffix)

    def generate_next_bit(self, seq):
        assert seq.endswith(self.suffix)
        probability = self.next_bit_probability
        return "1" if probability > random.random() else "0"

Multiple OnSuffixGenerators can be used to generate a sequence. They cooperate to cover different suffixes.

We don't know the used OnSuffixGenerators. We see only the beginning of a generated sequence. And we assume that it was generated by a suffix-based generator.

Brute Force Approach

If we want to predict the next bit of the sequence, we want to know the probability of the next bit being one:

P(NextBit=1|seq) = P(seq + "1") / P(seq)

For that, we need to calculate P(seq + "1") and P(seq). The probability of a sequence is its marginal probability:

P(seq) = sum(P(seq|generator) * P(generator) for
        generator in ALL_GENERATORS)

A brute force approach would calculate the expression directly. The number of all possible suffix-based generators is very big. Each generator is a cooperation of one or more OnSuffixGenerators.

For example, OnSuffixGenerators for "00", "10" and "1" suffixes can cooperate as:

def get_next_bit_generator(seq):
    if seq.endswith("00"):
        return OnSuffixGenerator("00")
    if seq.endswith("10"):
        return OnSuffixGenerator("10")
    if seq.endswith("1"):
        return OnSuffixGenerator("1")

The number of all possible suffix-based generators is bigger than 2**len(seq). The brute force approach would need to iterate over all of them.

Context-Tree Weighting Approach

Fortunately, we can calculate the sequence probability in a faster way. We can consider one OnSuffixGenerator in many cooperations at the same time.


If we have only two possible generators, the probability of a sequence is:

P(seq) = P(seq|generator1) * P(generator1) +
         P(seq|generator2) * P(generator2)

We will give them equal prior probabilities. You can incorporate a different prior belief, if you have it.

P(seq) = P(seq|generator1) * 0.5 +
         P(seq|generator2) * 0.5


We now need to express P(seq|generator1), when considering our possible sequence generators. A considered OnSuffixGenerator is generating only bits after the given suffix. A different generator is used to generate the remaining bits. I will need a notation to express the probability of just bits after the given suffix. I will use P(seq on s) syntax for the probability that the bits after the suffix s were generated by an OnSuffixGenerator(s).

P(seq on s) = P(seq|
        bits_with_different_suffix(seq, s),
P(seq on not_present_suffix) = P(seq|seq) = 1

I will use *s syntax to denote all suffixes ending with the suffix s.

Putting it together

Let's consider the probability of a sequence. For the bits after a suffix, we have only two possible generators:

  1. The bits can be generated by a generator with the given suffix.
  2. Or the bits can be generated by a cooperation of different generators with some longer suffixes.
P(seq on *s) = P(seq on s) * 0.5 +
        P(seq on *0s) * P(seq on *1s) * P(seq on STARTs) * 0.5

The "STARTs" suffix is matching only at the start of the sequence. It is used to cover the bits uncovered by the longer suffixes.

The probability of the whole sequence is the probability of all its bits:

P(seq) = P(seq on *)

The obtained sequence predictor isn't just theoretically nice. Context-Tree Weighting (CTW) variants are doing well on text compression benchmarks.


The recursive P(seq on *s) can reuse previous computations. It can be then computed in O(len(seq)) steps. The computation starts at the biggest depth and carries the partial results to the front.

I implemented the algorithm in Python. You can play with continuation of binary sequences.


  1. The Context-Tree Weighting Method: Basic Properties: This paper introduced the CTW method.
  2. Reflections on "The Context-Tree Weighting Method: Basic Properties": It complements the original paper in explanation.
  3. A Monte-Carlo AIXI Approximation: Joel Veness's C++ implementation of CTW shows clearly, how to implement the algorithm efficiently.
  4. Approximate Universal Artificial Intelligence: Chapter 3 in Joel Veness's PhD thesis explains how the used P(generator) prior is related to the generator complexity. Interesting future ideas are proposed in chapter 6.

Saturday, August 14, 2010

You don't see everything

Your perception is suppressed before a movement of your eyes. You cannot see your eyes moving when looking into a mirror. Look at the left eye, look at the right eye. Only an external observer will see the movement of your eyes.

Your perception is resumed when the image stops being blurred. That is, when the velocity of the eye matches the relative velocity of the object. That also allows you to see non-blurred details outside of a running train window.