## Sunday, October 2, 2011

### Intro to Sum-Product Networks

In 2011, a new probabilistic model was proposed. It is fast. It can represent many probability functions. And its learning prefers simple explanations. I will show you how the model looks like. You can then read the original paper for details: Sum-Product Networks: A New Deep Architecture.

## Computing Probabilities Quickly

Sum-product networks allow to compute the probability of an event quickly. They were first invented as a data structure for the quick computation.

### Network with one variable

A sum-product network is a directed acyclic graph. It has alternating layers of sum and product nodes. Only the edges below a sum node have weights. The weights are normalized to sum to 1.

To compute the probability of an evidence, the network is evaluated from bottom up.

``````P(X1=1) = net.eval_with(x1=1, not_x1=0)
= P(X1=1) * 1 + P(X1=0) * 0
``````

### Network with two independent variables

When two variables are independent, their join probability can be represented concisely. We don't need to store the probabilities of all their combinations. It is enough to store the factors: P(X1=1), P(X2=1).

The joint probability is calculated by evaluation the network:

``````P(X1=1, X2=1) = net.eval_with(
x1=1, not_x1=0,
x2=1, not_x2=0)
= (P(X1=1) * 1 + P(X1=0) * 0) * (P(X2=1) * 1 + P(X2=0) * 0)
= P(X1=1) * P(X2=1)
``````

It is also possible to calculate the probability of a subset of variables. Naively, we would evaluate the network multiple times:

``````P(X1=1) = P(X1=1, X2=0) + P(X1=1, X2=1)
``````

It can be done faster. We want the network to sum all the possible branches. We can do that by setting both x2 and not_x2 to 1. It is then enough to evaluate the network just once.

``````P(X1=1) = net.eval_with(
x1=1, not_x1=0,
x2=1, not_x2=1)
= (P(X1=1) * 1 + P(X1=0) * 0) * (P(X2=1) * 1 + P(X2=0) * 1)
= P(X1=1) * 1.0
``````

The computation is done in O(num_edges) steps.

### Network with conditional independence

Conditional independence also helps to make the network concise. In the above network, variables X1 and X2 are independent when given X3. The value of X3 switches between the branches of the network.

``````P(X1=1, X2=1, X3=1) = net.eval_with(
x1=1, not_x1=0,
x2=1, not_x2=0,
x3=1, not_x3=0)
= (P(X1=1|X3=1) * P(X2=1|X3=1) * 1) * P(X3=1)
``````

This network can represent the same probabilities as a Naive Bayes model.

## Expressive Power

Any Bayesian network can be converted to a sum-product network. The resulting sum-product network may have many edges. In the worst case, the number of edges in the sum-product network will be proportional to the time complexity of computing probabilities in the Bayesian network (i.e., O(num_vars * 2**treewidth)).

Some probability functions can be represented more concisely by a sum-product network. For example, the sum-product network can omit edges with zero weights. And a sum-product network can reuse nodes. The network does not need to be a tree. A node can be reused by multiple parents.

The concise sum-product network can be learned directly from training examples.

## Learning

Learning is done by maximizing likelihood P(training_examples|net). The learned network should give high probability to each seen example.

### Requirements for network structure

At the start, we don't know the network structure and the edge weights. We have some requirements for the network structure:

• The number of edges should be small. It forces the network to find general explanations for the seen examples. That should improve generalization. The network will have also small number of weights. A small number of examples will be enough to estimate the weights.

The small number of edges will also make the network evaluation fast.

• The network should produce valid probabilities. The probability of an event should be equal to the sum of the probabilities of all the included outcomes.

Fortunately, the validity of the produced probabilities can be ensured by two simple constrains:

1. All children of a sum node must use the same variables. No variable can be missing and no variable can be appended. Only a product node can split the set of used variables to subsets.

2. A product node must not multiply xi with not_xi.

### Learning steps

A valid network can be learned by:

1. Define a valid structure of the network. The structure can have many layers and many edges. The initial weights will be zero.

2. Increment the probability of each training example, by incrementing some weights.

3. Only the edges with non-zero weights will be retained in the final network.

The finding of the weights to increment is also fast. It is similar to evaluating of the network. The most active edges should have their weights incremented.

The details can be seen in the SPN source code. The learning stops at a local maximum. Ask me for an explanation, if you are interested.

Guido said...

Looks interesting. Is there any known implementation or is it just DIY?

Ivo said...

There is a Java implementation by the original authors:
http://alchemy.cs.washington.edu/spn/

You can run it on your laptop. See the user guide. It uses MPI to distribute the computation to multiple cores or machines.

vlasdas said...

Very nice blog. Keep the good work! Sum-products networks seem to be promising.

Ziheng Jiang said...

Clear and concise. Much easier to understand than the original paper...:). Thanks, dude.

Ivo Danihelka said...

I will try to explain some unmentioned things.

Input to the network
====================
At the bottom layer, each pixel represents a cell in a grid.
Each pixels is modeled by 4 models: white, whiteGrey, blackGrey, black.
So the cell holds 4 values:
P(pixel|PixelModel=white), P(pixel|PixelModel=whiteGrey),
P(pixel|PixelModel=blackGrey), P(pixel|PixelModel=black)

For the images, these probabilities are modeled by a 4 Gaussians.
Each Gaussian has a different mean. E.g., the mean of the "white"
model corresponds to intensity of white pixels.

Product nodes:
Each product node joins two regions together. At the lowest layer,
a region is just a pixel position. The joined probability is:
P(pixel1, pixel2|ThisProductNode).

For example, one product node can be joining the often-white pixel1
and the often-black pixel2:
P(pixel1, pixel2|ThisProductNode) = P(pixel1|PixelModel=white) *
P(pixel2|PixelModel=black)

Different product nodes can be joining different combinations of pixel values.

In higher layers, bigger and bigger regions are joined together.
In the SPN Java code, they allow to join adjacent regions together.

SPN is not limited to joining just adjacent regions.
But the space of possible combinations to check needs to small.
The hard EM training is searching for product nodes with high likelihood:
P(region|ProductNode).

======================
You know EM for a mixture of Guassians, right?
The "hard EM" refers to using k-means like training instead.
Only one centroid is selected for each example.
The "centroid" is a branch in the SPN tree.
The weights on the selected branch are increased,
to increase the P(example|whole_net).

The best branch to select, is the branch which would produce
the highest P(example|whole_net), after increasing the branch weights.
So when looking for the best branch:
1) The network is evaluated from bottom up.
2) When evaluating a sum node,
all possible increments are considered. A consideration: increment the weight
for one of the children and keep the other weights unchanged.
=> Find the child with the highest P(covered_region|childNode)
3) The best found P(covered_region|ThisSumNode_with_increment) is reported up.
4) The best branch has the highest P(example|ThisBranch_with_increment)
at the top.

BTW, I recommend the ML and PGM classes at Coursera.

karine miras said...

Thank you !!!