Lessoned Code

Results from Sum-Product Networks

2011-12-30T19:49:00.000+01:00

I'm now experimenting with sum-product networks. I will show you some obtained results.

Predicting a half of the image

The above picture shows:

The original images in the first row.
The visible part of the input in the second row.
The expected pixel intensities predicted by the network.

The shown images are from a validation set. The sum-product network was trained on different images. The training was done on 800 images from the notMNIST dataset.

The trained sum-product network is a probabilistic model. It can compute the probability of something, when given some evidence. Here, the network is asked to compute the expected intensity of each pixel. The given evidence is the right half of the image.

Some of the expected pixel intensities look bad. Some of them look good. If multiple values are possible for a pixel, the expected intensity will be a weighted average of the possibilities.

Predicting every second column

The expected pixel intensities look better when giving every second column as the evidence. The number of pixels to predict is still one half of the image.

The better result can be explained by:

The set of all probable whole images is better pruned, when showing pixels from all parts of the image. The expected pixel intensity will be an average from fewer possibilities.
The used network structure knows about locality of pixels. Nearby image regions are connected at the bottom layers of the network.

The whole network gives high probability to some patterns. And it gives lower probability to other patterns. The top sum node has multiple possible patterns as children. The children of a sum node are product nodes. The product nodes split the pattern to smaller sub-patterns.

If two nearby sub-patterns occur together, it is easy to connect the sub-patterns by a product node. The new pattern means: sub_pattern1 AND sub_pattern2. The new pattern will occur often, if the sub-patterns occur often together. Such patterns are discovered when training the network.

The Big Picture

2011-10-24T10:59:00.001+02:00

It is helpful to view all machine-learning methods as approximations of Bayesian inference. It allows to devise new approximations or to make some approximations more precise.

The following two videos show the unified view. They explain it better than I would do.

Note that the goal is to minimize the expected loss. The expectation is over all possible examples. Modeling P(X, Y) can help to have small loss on unseen examples.

Footnote:
Even SVM can be viewed as a probabilistic model.

Intro to Sum-Product Networks

2011-10-02T15:19:00.000+02:00

In 2011, a new probabilistic model was proposed. It is fast. It can represent many probability functions. And its learning prefers simple explanations. I will show you how the model looks like. You can then read the original paper for details: Sum-Product Networks: A New Deep Architecture.

Computing Probabilities Quickly

Sum-product networks allow to compute the probability of an event quickly. They were first invented as a data structure for the quick computation.

Network with one variable

A sum-product network is a directed acyclic graph. It has alternating layers of sum and product nodes. Only the edges below a sum node have weights. The weights are normalized to sum to 1.

To compute the probability of an evidence, the network is evaluated from bottom up.

P(X1=1) = net.eval_with(x1=1, not_x1=0)
    = P(X1=1) * 1 + P(X1=0) * 0

Network with two independent variables

When two variables are independent, their join probability can be represented concisely. We don't need to store the probabilities of all their combinations. It is enough to store the factors: P(X1=1), P(X2=1).

The joint probability is calculated by evaluation the network:

P(X1=1, X2=1) = net.eval_with(
        x1=1, not_x1=0,
        x2=1, not_x2=0)
    = (P(X1=1) * 1 + P(X1=0) * 0) * (P(X2=1) * 1 + P(X2=0) * 0)
    = P(X1=1) * P(X2=1)

It is also possible to calculate the probability of a subset of variables. Naively, we would evaluate the network multiple times:

P(X1=1) = P(X1=1, X2=0) + P(X1=1, X2=1)

It can be done faster. We want the network to sum all the possible branches. We can do that by setting both x2 and not_x2 to 1. It is then enough to evaluate the network just once.

P(X1=1) = net.eval_with(
        x1=1, not_x1=0,
        x2=1, not_x2=1)
    = (P(X1=1) * 1 + P(X1=0) * 0) * (P(X2=1) * 1 + P(X2=0) * 1)
    = P(X1=1) * 1.0

The computation is done in O(num_edges) steps.

Network with conditional independence

Conditional independence also helps to make the network concise. In the above network, variables X1 and X2 are independent when given X3. The value of X3 switches between the branches of the network.

P(X1=1, X2=1, X3=1) = net.eval_with(
        x1=1, not_x1=0,
        x2=1, not_x2=0,
        x3=1, not_x3=0)
    = (P(X1=1|X3=1) * P(X2=1|X3=1) * 1) * P(X3=1)

This network can represent the same probabilities as a Naive Bayes model.

Expressive Power

Any Bayesian network can be converted to a sum-product network. The resulting sum-product network may have many edges. In the worst case, the number of edges in the sum-product network will be proportional to the time complexity of computing probabilities in the Bayesian network (i.e., O(num_vars * 2**treewidth)).

Some probability functions can be represented more concisely by a sum-product network. For example, the sum-product network can omit edges with zero weights. And a sum-product network can reuse nodes. The network does not need to be a tree. A node can be reused by multiple parents.

The concise sum-product network can be learned directly from training examples.

Learning

Learning is done by maximizing likelihood P(training_examples|net). The learned network should give high probability to each seen example.

Requirements for network structure

At the start, we don't know the network structure and the edge weights. We have some requirements for the network structure:

The number of edges should be small. It forces the network to find general explanations for the seen examples. That should improve generalization. The network will have also small number of weights. A small number of examples will be enough to estimate the weights.

The small number of edges will also make the network evaluation fast.
The network should produce valid probabilities. The probability of an event should be equal to the sum of the probabilities of all the included outcomes.

Fortunately, the validity of the produced probabilities can be ensured by two simple constrains:

All children of a sum node must use the same variables. No variable can be missing and no variable can be appended. Only a product node can split the set of used variables to subsets.
A product node must not multiply xi with not_xi.

Learning steps

A valid network can be learned by:

Define a valid structure of the network. The structure can have many layers and many edges. The initial weights will be zero.
Increment the probability of each training example, by incrementing some weights.
Only the edges with non-zero weights will be retained in the final network.

The finding of the weights to increment is also fast. It is similar to evaluating of the network. The most active edges should have their weights incremented.

The details can be seen in the SPN source code. The learning stops at a local maximum. Ask me for an explanation, if you are interested.

Decision Trees without Pruning

2011-09-07T19:15:00.000+02:00

Do you want to improve generalization of a decision tree? Do no pruning. Try full Bayesian weighting instead. A prediction can be still done in O(tree_depth) steps.

Pruning is an approximation of Bayesian weighting. It gives weight 1.0 to one short tree. And it gives weights 0 to all other trees.

It can be done better. We can give weights to all tree depths. Predictions from shorter trees would get bigger weights. Context Tree Weighting allows that.

Context

Context Tree Weighting is normally used to predict the next bit of a string. For that, it uses the suffix of the string as a context. Different predictions are done from different context lengths. The final prediction is a weighted sum of the predictions. The shorter context lengths get bigger weights.

The context does not need to be a suffix. We can use any list of bits as a context.

context = get_context(model, training_example)

Training

First, a fully grown binary decision tree would be constructed. Information gain or some other greedy heuristic can be used for that.

The decision tree can be now used to select contexts for the Context Tree Weighting model. Different decision tree depths should serve as different context lengths. We would use the path from the tree root to a tree leaf as the context.

def get_context(model, training_example):
    path = []
    node = model.tree.root
    while node is not None:
        child_index = node.choose_child(training_example)
        path.append(child_index)
        node = node.children[child_index]

    path.reverse()
    return path

The path is reversed to have the most important bit on the right. In a suffix the most important bit is also on the right.

Each training example would be used to train the Context Tree Weighting model. The training precomputes values in the model. The precomputed values can be stored in the decision tree nodes. Each example is processed in O(tree_depth) steps.

Prediction

The decision tree is used as a normal Context Tree Weighting model. A context of the test example is obtained. And a prediction is made based on the context.

Resources

The possibility of non-suffix contexts was mentioned in chapter 6 of Approximate Universal Artificial Intelligence.

The weighting of different decision tree depths is simple. I guess, somebody else suggested it before. I haven't found a paper mentioning it. So I share the trick here.

To the edge of our galaxy

2011-08-07T15:35:00.001+02:00

It is possible to travel 500+ light years in 80 years. And you don't need to travel faster than light. Stephen Hawking mentioned the possibility at the end of his article.

Here's how

Use a giant rocket. It needs to carry rocket fuel for 6 years of accelerating. After the 6 years, you would travel at 99% of the speed of light. You would be moving very fast in space. You would be moving less in time. Your clock would go slower than clocks on Earth.

A Bayesian Sequence Predictor

2011-06-26T16:35:00.006+02:00

It is possible to use Bayesian inference for sequence prediction. I will show how to predict a continuation of a binary sequence.

For example, we want to know what will be the next bit, after seeing "01101".

Considered Generators

I will not consider all possible sequence generators. All possible sequence generators are all possible computer programs. I will consider a smaller set of generators. I will consider all generators, where the next generated bit depends only on a sequence suffix.

Each suffix will assign a different probability to the next bit:

import random

class OnSuffixGenerator:
    def __init__(self, suffix):
        self.suffix = suffix
        self.next_bit_probability = unknown_function(suffix)

    def generate_next_bit(self, seq):
        assert seq.endswith(self.suffix)
        probability = self.next_bit_probability
        return "1" if probability > random.random() else "0"

Multiple OnSuffixGenerators can be used to generate a sequence. They cooperate to cover different suffixes.

We don't know the used OnSuffixGenerators. We see only the beginning of a generated sequence. And we assume that it was generated by a suffix-based generator.

Brute Force Approach

If we want to predict the next bit of the sequence, we want to know the probability of the next bit being one:

P(NextBit=1|seq) = P(seq + "1") / P(seq)

For that, we need to calculate P(seq + "1") and P(seq). The probability of a sequence is its marginal probability:

P(seq) = sum(P(seq|generator) * P(generator) for
        generator in ALL_GENERATORS)

A brute force approach would calculate the expression directly. The number of all possible suffix-based generators is very big. Each generator is a cooperation of one or more OnSuffixGenerators.

For example, OnSuffixGenerators for "00", "10" and "1" suffixes can cooperate as:

def get_next_bit_generator(seq):
    if seq.endswith("00"):
        return OnSuffixGenerator("00")
    if seq.endswith("10"):
        return OnSuffixGenerator("10")
    if seq.endswith("1"):
        return OnSuffixGenerator("1")

The number of all possible suffix-based generators is bigger than 2**len(seq). The brute force approach would need to iterate over all of them.

Context-Tree Weighting Approach

Fortunately, we can calculate the sequence probability in a faster way. We can consider one OnSuffixGenerator in many cooperations at the same time.

Weighting

If we have only two possible generators, the probability of a sequence is:

P(seq) = P(seq|generator1) * P(generator1) +
         P(seq|generator2) * P(generator2)

We will give them equal prior probabilities. You can incorporate a different prior belief, if you have it.

P(seq) = P(seq|generator1) * 0.5 +
         P(seq|generator2) * 0.5

Notation

We now need to express P(seq|generator1), when considering our possible sequence generators. A considered OnSuffixGenerator is generating only bits after the given suffix. A different generator is used to generate the remaining bits. I will need a notation to express the probability of just bits after the given suffix. I will use P(seq on s) syntax for the probability that the bits after the suffix s were generated by an OnSuffixGenerator(s).

P(seq on s) = P(seq|
        bits_with_different_suffix(seq, s),
        Generator=OnSuffixGenerator(s))
P(seq on not_present_suffix) = P(seq|seq) = 1

I will use *s syntax to denote all suffixes ending with the suffix s.

Putting it together

Let's consider the probability of a sequence. For the bits after a suffix, we have only two possible generators:

The bits can be generated by a generator with the given suffix.
Or the bits can be generated by a cooperation of different generators with some longer suffixes.

P(seq on *s) = P(seq on s) * 0.5 +
        P(seq on *0s) * P(seq on *1s) * P(seq on STARTs) * 0.5

The "STARTs" suffix is matching only at the start of the sequence. It is used to cover the bits uncovered by the longer suffixes.

The probability of the whole sequence is the probability of all its bits:

P(seq) = P(seq on *)

The obtained sequence predictor isn't just theoretically nice. Context-Tree Weighting (CTW) variants are doing well on text compression benchmarks.

Implementation

The recursive P(seq on *s) can reuse previous computations. It can be then computed in O(len(seq)) steps. The computation starts at the biggest depth and carries the partial results to the front.

I implemented the algorithm in Python. You can play with continuation of binary sequences.

Resources

The Context-Tree Weighting Method: Basic Properties: This paper introduced the CTW method.
Reflections on "The Context-Tree Weighting Method: Basic Properties": It complements the original paper in explanation.
A Monte-Carlo AIXI Approximation: Joel Veness's C++ implementation of CTW shows clearly, how to implement the algorithm efficiently.
Approximate Universal Artificial Intelligence: Chapter 3 in Joel Veness's PhD thesis explains how the used P(generator) prior is related to the generator complexity. Interesting future ideas are proposed in chapter 6.

You don't see everything

2010-08-14T11:39:00.000+02:00

Your perception is suppressed before a movement of your eyes. You cannot see your eyes moving when looking into a mirror. Look at the left eye, look at the right eye. Only an external observer will see the movement of your eyes.

Your perception is resumed when the image stops being blurred. That is, when the velocity of the eye matches the relative velocity of the object. That also allows you to see non-blurred details outside of a running train window.

Problem Simplification

2010-06-30T18:53:00.004+02:00

I will mention how different AI problems are related. I present them from the most general problems to the most specific problems. Each more specific problem includes also the assumptions from the previous simplifications.

The problems:

1) General Intelligence

The goal is to choose an action to maximize the future total reward:

best_y(observations) = argmax_action future_total_reward(
    observations, action)

This problem is considered by AIXI.

2) Reinforcement Learning

Assumptions:

The environment is stationary. I.e., P(Trajectory) is a fixed probability distribution.

Implications:

We can talk about the expected value of a function with respect to the probability distribution.
A fixed policy is enough for the fixed environment.

best_policy = argmax_policy E[
    total_reward(Trajectory)|policy]

best_y(observations) = a draw from best_policy(Y|observations)

Note that it is not needed to compute the expected total reward. Its gradient is enough.

3) Contextual Bandits

Assumptions:

An action does not affect the future observations. The sequence of seen contexts is already assigned.
The reward depends only on the used action and the context.

Implications:

There is no delayed future reward. Only the immediate reward is caused by the action. There is no confusion what caused it. That simplifies training.
It is not needed to use a stochastic policy. It cannot help us from being stuck. We cannot affect the observations. The chosen decision could be deterministic.

best_y = argmax_y E[total_reward(Contexts, y)]

4) Supervised Regression

The agent is presented with (x, target) examples.

Assumptions:

The seen examples are independent. Their probability does not depend on the already seen examples.
The examples are identically distributed. They share a P(Target, X) probability distribution.
The reward function is known. It is possible to compute all possible rewards after seeing a target.

Implications:

The maximum of the expected total reward is at the same point as the maximum of the expected reward from a single example.
No exploration is needed. We cannot affect the future observations. And we can compare all possible rewards.

best_y = argmax_y E[reward(Target, y(X))]

The distribution P(X,Target) is still unknown.

5) Squared Loss

Assumptions:

The reward function is known to be:

reward(target, y(x)) = - constant * (target - y(x))**2

Implications:

We can find the argmax_y explicitly. The derivative of the expected reward is a linear function of y. It leads to the following solution:

best_y(x) = E[Target|x]

Other simplifications could go in different directions. For example, planning assumes a known deterministic environment.

Used Offline Resources

The supervised learning problem is described in chapter 1 of Bishop's Pattern Recognition and Machine Learning.

Statistical Analysis Community

2010-06-24T11:31:00.005+02:00

Do you want to teach yourself machine-learning? The following links may guide you:

The above articles were written by the brain behind FlightCaster. That give them some credibility.

There is also a proposal to create a Q&A site for statistics, data analysis, data mining and data visualization. Join the community if you plan to ask and answer statistical questions. We need more people to start the beta phase. Programmers already have their Q&A site at Stack Overflow.

Common steps in machine learning

2010-05-30T12:33:00.021+02:00

I will summary one way to solve a machine learning problem. These abstract steps fit many problems.

Understand the task. See how to measure the performance.

A computer program is learning if its performance improves with more experience. We are going to design such a program.
Choose the source of training experience.
Decide what will be input and output.
Choose a set of models to approximate the output function.
For example, the set could be a set of linear functions or a set of neural networks with N hidden neurons. We should use our knowledge of the problem to restrict the set. The set only needs to contain the true output function or its good approximation.
Choose a learning algorithm.
The algorithm will select one model from the given set of models. The model is selected based on high probability under the given training data (i.e., P(model|data)).

We may also select multiple models and use a weighted vote from them. The models should be mutually exclusive. Every model should have just one vote.

Resources

Chapters 1, 2 and 6 of Mitchell's Machine Learning.

Viewing programming as constraint satisfaction

2010-04-10T15:14:00.005+02:00

Programming could be viewed as solving a constraint satisfaction problem. We start with some empty disk space and the goal is have a program there. Only the goal state is important. We are not searching for a path. We are searching for a goal state that meets the constraints.

I could view my programming as an attempt to solve the problem efficiently. Modifying existing source code is like a mutation in a local search. Hitting an unexpected constraint and choosing a different path is like backtracking. And rewriting allows me to get rid of old constraints.

A super intelligent machine may laugh. It will see the inefficiency of my primitive attempts. I would seem like a 19 month old baby.

Ulimited "cd -" history

2010-04-02T22:04:00.007+02:00

I use bash and I use "cd -" to go to the previous directory. To be able to go back more than one step, I had to replace the original cd command with a function:

# cd with automatic pushd
function cd() {
    if test "x$1" = "x-" ; then
        popd >/dev/null
    else
        pushd . >/dev/null
        builtin cd "$@"
    fi
}

Put that into your ~/.bashrc and start a new bash.

Example usage

/home$ cd /opt
/opt$ cd /var/log
/var/log$ cd -
/opt$ cd -
/home$

Learning from history

2010-03-06T09:07:00.020+01:00

It is amazing to see that learning from historical data is theoretically solved. For example, we could calculate the probability that all crows are black when N black crows were seen previously:

P(all_crows_are_black|seen_N_black_crows)
    = P(seen_N_black_crows|all_crows_are_black) * P(all_crows_are_black)
      * 1/P(seen_N_black_crows)
    = 1 * P(all_crows_are_black) * 1/P(seen_N_black_crows)

A different model could predict that 90% of crows are black. Its probability after seeing N black crows would be:

P(90%_of_crows_are_black|seen_N_black_crows)
    = 0.9**N * P(90%_of_crows_are_black) * 1/P(seen_N_black_crows)

The 1/P(seen_N_black_crows) constant is not known. We could interpret it as a normalization constant. It ensures that the sum of probabilities of all possible models is 1. Or we could ignore it if just comparing the probabilities of different models.

Many models could have non-zero probability when given a small history. We should use them all when making a prediction. A prediction is just the probability of unseen data based on the seen data. That is calculated by:

P(data|old_data) 
    = sum(P(data|h,old_data) * P(h|old_data) for h in ALL_MODELS)

This approach is completely general. It could be used for non-independent samples, time series, everything. We would then work with models that predict such non-independent data or time series.

Additional Resources

AI: A Modern Approach chapters 13, 14 and 20 give an introduction to probabilities and Bayesian learning.
On Universal Prediction and Bayesian Confirmation by Marcus Hutter. It hints how to estimate the P(model|no_data_yet) probabilities. Simpler models are preferred.

Many layers are needed

2010-01-23T13:06:00.002+01:00

I have read an interesting paper on limitations of machine learning models: Scaling Learning Algorithms towards AI. It mentions limitation of two-layer neural networks and other two-layer models (SVMs). These shallow models are unable to learn some functions without an exponential number of components. For example, to learn the parity function over N input bits, they would need 2^N hidden neurons.

On the other hand, a deep model with N layers could compute the parity with just N components.

Humane utility function

2010-01-19T20:07:00.008+01:00

It will be hard to design a utility function for a strong AI. The utility function should express what the AI should maximize. Humans still cannot decide what weight to assign to lives. Especially if you have to decide between lives and the restoration of order in a society.

AI for real life: Understanding autonomy

2010-01-02T14:12:00.003+01:00

There are talks about autonomy and other forms of motivations. But I realized the meaning of autonomy only after reading about it in an AI book.

Let's first define how we want an agent or an employee to behave. We want him to try his best to maximize our score assigned to him. When doing "his best", he can only use his prior knowledge or his senses. We should not blame a deaf person for not running in reaction to the sound "fire". We should also not blame a person without the prior knowledge of the English word "fire". They are acting rationally under the given conditions.

And autonomy is the ability to enhance or correct the prior knowledge. An autonomous agent does not need to follow the rules defined in the prior knowledge. He could override them if it makes sense to him. For example, he does not have to run immediately after hearing "fire". He can grab the deaf person's hand first.

Solving Problems by Searching

2009-12-06T22:14:00.010+01:00

If you are new to AI, you will want to read this sample chapter: Solving Problems by Searching.
It explains the breadth-first search, A*, heuristic estimates, ...

It is from the newly updated AI: A Modern Approach 3rd Edition.

AI for real life: The perfect is the enemy of the good

2009-11-30T21:21:00.011+01:00

I have seen that searching for a perfect solution is much harder than searching for a good solution. My experience is based on AI planning. You have to find a path from a start to a goal there. A perfect solution would find the shortest path.

If you need the shortest path, you have to examine all promising paths. You cannot skip a path that could be possibly shorter than the currently best known solution.

If a good solution is enough, you have more possibilities how to find the solution. For example, you could stop the search when no promising path would be X-times better. You could also be more realistic about the promising paths. It is OK to over-estimate the length of a path. That would postpone the path for later examination. A good enough solution could be found in the meantime.

The hardness of the search is seen on the International Planning Competition 2008. They have two tracks:

The "satisficing track" is for planners searching a good solution.
The "optimization track" is for cost-optimal planners. They are searching only for the perfect solution.

The competition uses much larger problems for the satisficing planners. The cost-optimal planners would not be able to solve the same problems in the given time.

It is also interesting that no cost-optimal planner was better than a basic breadth-first search. The breadth-first search was used as the "baseline" for the "optimization track". A planner would need prior knowledge, to carefully estimate the lengths of the paths. Otherwise it is not faster than doing the walk.

Additional reading

The prior knowledge could be, for example, knowledge of solutions to easier problems. You then know that the harder problem will take at least the same number of steps: Hierarchical A*: Searching Abstraction Hierarchies Efficiently

Face detection that just works

2009-11-28T15:39:00.009+01:00

I may be the last man who noticed this. My new camera does face detection when looking at persons. And it just works. You only use AI knowledge to appreciate it.

AI for real life: Optimize team utility

2009-11-27T17:49:00.009+01:00

I will write some articles about Artificial Intelligence (AI). Especially, they will be about what I have learned from AI. The first one is about teamwork.

A work in a team is different from a solo work. They are two different problems. When you are working solo, your goal is to maximize the amount of work done by you. In AI, they would say that an agent optimizes its utility function. It is like a score you receive from a finished game.

Inside a team, the goal of the problem is different. The goal is to maximize the amount of work done by the team. You should maximize the sum of the work done by you and your coworkers.

It is a much harder problem to optimize the sum of the utilities. If you are choosing an action, you would like to be able to predict its effects. A good start is to create a model of the internal state of your coworker. We humans call that empathy.

To understand AI:
Universal Intelligence: A Definition of Machine Intelligence

To understand other people:
How to Win Friends & Influence People

Python string concatenation performance

2008-07-24T10:49:00.006+02:00

A small test revealed that "".join(listOfStrings) is not faster than plain +=. ~~The .join() is slower.~~ Using .append() and .join() is slower.

Time with .join():

real    0m2.908s

Time with +=:

real    0m1.742s

The test:

#!/usr/bin/env python

def combine(inc, count):
    text = ""
    for i in xrange(count):
        text += inc
    return len(text)

def combineByJoin(inc, count):
    text = []
    for i in xrange(count):
        text.append(inc)
    text = "".join(text)
    return len(text)

def main():
    inc = "a" * 10
    print combine(inc, 10000000)
    #print combineByJoin(inc, 10000000)

main()

Tested on Python 2.5.2.

Stop disk clicking

2008-07-03T08:17:00.004+02:00

If your disk is clicking on inactivity, it could be because it has too aggressive power management set. Each click will increase Load_Cycle_Count. Check that:

$ sudo smartctl -A /dev/sda | grep Load

To stop it, use hdparm:

$ sudo hdparm -B 254 /dev/sda

See how to force the hdparm setting on boot and resume: https://wiki.ubuntu.com/DanielHahler/Bug59695

Form submit via jQuery

2008-06-28T15:06:00.004+02:00

With jQuery it is easy to implement form submit by Ajax. This is how I do it:

$("#filter").submit(function(event) {
    event.preventDefault();
    event.stopPropagation();

    var params = {};
    $(":input", this).each(function() {
      params[this.name] = this.value;
    });

    $.get("/ajax/filter", params, aCallback);
});

A working example is the filter form on Jaký Byt: Mapa.