Monday, October 24, 2011

The Big Picture

It is helpful to view all machine-learning methods as approximations of Bayesian inference. It allows to devise new approximations or to make some approximations more precise.

The following two videos show the unified view. They explain it better than I would do.

Note that the goal is to minimize the expected loss. The expectation is over all possible examples. Modeling P(X, Y) can help to have small loss on unseen examples.

Footnote:
Even SVM can be viewed as a probabilistic model.

Sunday, October 2, 2011

Intro to Sum-Product Networks

In 2011, a new probabilistic model was proposed. It is fast. It can represent many probability functions. And its learning prefers simple explanations. I will show you how the model looks like. You can then read the original paper for details: Sum-Product Networks: A New Deep Architecture.

Computing Probabilities Quickly

Sum-product networks allow to compute the probability of an event quickly. They were first invented as a data structure for the quick computation.

Network with one variable

A sum-product network is a directed acyclic graph. It has alternating layers of sum and product nodes. Only the edges below a sum node have weights. The weights are normalized to sum to 1.

To compute the probability of an evidence, the network is evaluated from bottom up.

P(X1=1) = net.eval_with(x1=1, not_x1=0)
    = P(X1=1) * 1 + P(X1=0) * 0

Network with two independent variables

When two variables are independent, their join probability can be represented concisely. We don't need to store the probabilities of all their combinations. It is enough to store the factors: P(X1=1), P(X2=1).

The joint probability is calculated by evaluation the network:

P(X1=1, X2=1) = net.eval_with(
        x1=1, not_x1=0,
        x2=1, not_x2=0)
    = (P(X1=1) * 1 + P(X1=0) * 0) * (P(X2=1) * 1 + P(X2=0) * 0)
    = P(X1=1) * P(X2=1)

It is also possible to calculate the probability of a subset of variables. Naively, we would evaluate the network multiple times:

P(X1=1) = P(X1=1, X2=0) + P(X1=1, X2=1)

It can be done faster. We want the network to sum all the possible branches. We can do that by setting both x2 and not_x2 to 1. It is then enough to evaluate the network just once.

P(X1=1) = net.eval_with(
        x1=1, not_x1=0,
        x2=1, not_x2=1)
    = (P(X1=1) * 1 + P(X1=0) * 0) * (P(X2=1) * 1 + P(X2=0) * 1)
    = P(X1=1) * 1.0

The computation is done in O(num_edges) steps.

Network with conditional independence

Conditional independence also helps to make the network concise. In the above network, variables X1 and X2 are independent when given X3. The value of X3 switches between the branches of the network.

P(X1=1, X2=1, X3=1) = net.eval_with(
        x1=1, not_x1=0,
        x2=1, not_x2=0,
        x3=1, not_x3=0)
    = (P(X1=1|X3=1) * P(X2=1|X3=1) * 1) * P(X3=1)

This network can represent the same probabilities as a Naive Bayes model.

Expressive Power

Any Bayesian network can be converted to a sum-product network. The resulting sum-product network may have many edges. In the worst case, the number of edges in the sum-product network will be proportional to the time complexity of computing probabilities in the Bayesian network (i.e., O(num_vars * 2**treewidth)).

Some probability functions can be represented more concisely by a sum-product network. For example, the sum-product network can omit edges with zero weights. And a sum-product network can reuse nodes. The network does not need to be a tree. A node can be reused by multiple parents.

The concise sum-product network can be learned directly from training examples.

Learning

Learning is done by maximizing likelihood P(training_examples|net). The learned network should give high probability to each seen example.

Requirements for network structure

At the start, we don't know the network structure and the edge weights. We have some requirements for the network structure:

  • The number of edges should be small. It forces the network to find general explanations for the seen examples. That should improve generalization. The network will have also small number of weights. A small number of examples will be enough to estimate the weights.

    The small number of edges will also make the network evaluation fast.

  • The network should produce valid probabilities. The probability of an event should be equal to the sum of the probabilities of all the included outcomes.

Fortunately, the validity of the produced probabilities can be ensured by two simple constrains:

  1. All children of a sum node must use the same variables. No variable can be missing and no variable can be appended. Only a product node can split the set of used variables to subsets.

  2. A product node must not multiply xi with not_xi.

Learning steps

A valid network can be learned by:

  1. Define a valid structure of the network. The structure can have many layers and many edges. The initial weights will be zero.

  2. Increment the probability of each training example, by incrementing some weights.

  3. Only the edges with non-zero weights will be retained in the final network.

The finding of the weights to increment is also fast. It is similar to evaluating of the network. The most active edges should have their weights incremented.

The details can be seen in the SPN source code. The learning stops at a local maximum. Ask me for an explanation, if you are interested.