Lessoned Code: Problem Simplification

I will mention how different AI problems are related. I present them from the most general problems to the most specific problems. Each more specific problem includes also the assumptions from the previous simplifications.

The problems:

1) General Intelligence

The goal is to choose an action to maximize the future total reward:

best_y(observations) = argmax_action future_total_reward(
    observations, action)

This problem is considered by AIXI.

2) Reinforcement Learning

Assumptions:

The environment is stationary. I.e., P(Trajectory) is a fixed probability distribution.

Implications:

We can talk about the expected value of a function with respect to the probability distribution.
A fixed policy is enough for the fixed environment.

best_policy = argmax_policy E[
    total_reward(Trajectory)|policy]

best_y(observations) = a draw from best_policy(Y|observations)

Note that it is not needed to compute the expected total reward. Its gradient is enough.

3) Contextual Bandits

Assumptions:

An action does not affect the future observations. The sequence of seen contexts is already assigned.
The reward depends only on the used action and the context.

Implications:

There is no delayed future reward. Only the immediate reward is caused by the action. There is no confusion what caused it. That simplifies training.
It is not needed to use a stochastic policy. It cannot help us from being stuck. We cannot affect the observations. The chosen decision could be deterministic.

best_y = argmax_y E[total_reward(Contexts, y)]

4) Supervised Regression

The agent is presented with (x, target) examples.

Assumptions:

The seen examples are independent. Their probability does not depend on the already seen examples.
The examples are identically distributed. They share a P(Target, X) probability distribution.
The reward function is known. It is possible to compute all possible rewards after seeing a target.

Implications:

The maximum of the expected total reward is at the same point as the maximum of the expected reward from a single example.
No exploration is needed. We cannot affect the future observations. And we can compare all possible rewards.

best_y = argmax_y E[reward(Target, y(X))]

The distribution P(X,Target) is still unknown.

5) Squared Loss

Assumptions:

The reward function is known to be:

reward(target, y(x)) = - constant * (target - y(x))**2

Implications:

We can find the argmax_y explicitly. The derivative of the expected reward is a linear function of y. It leads to the following solution:

best_y(x) = E[Target|x]

Other simplifications could go in different directions. For example, planning assumes a known deterministic environment.

Used Offline Resources

The supervised learning problem is described in chapter 1 of Bishop's Pattern Recognition and Machine Learning.

Lessoned Code

Wednesday, June 30, 2010

Problem Simplification

1) General Intelligence

2) Reinforcement Learning

3) Contextual Bandits

4) Supervised Regression

5) Squared Loss

Used Offline Resources

No comments:

About Me

Blog Archive

Lessoned Code

Wednesday, June 30, 2010

Problem Simplification

1) General Intelligence

2) Reinforcement Learning

3) Contextual Bandits

4) Supervised Regression

5) Squared Loss

Used Offline Resources

No comments:

Subscribe To

About Me

Blog Archive