Wednesday, June 30, 2010

Problem Simplification

I will mention how different AI problems are related. I present them from the most general problems to the most specific problems. Each more specific problem includes also the assumptions from the previous simplifications.

The problems:

1) General Intelligence

The goal is to choose an action to maximize the future total reward:

best_y(observations) = argmax_action future_total_reward(
    observations, action)

This problem is considered by AIXI.

2) Reinforcement Learning

Assumptions:

  1. The environment is stationary. I.e., P(Trajectory) is a fixed probability distribution.

Implications:

  • We can talk about the expected value of a function with respect to the probability distribution.
  • A fixed policy is enough for the fixed environment.
best_policy = argmax_policy E[
    total_reward(Trajectory)|policy]

best_y(observations) = a draw from best_policy(Y|observations)

Note that it is not needed to compute the expected total reward. Its gradient is enough.

3) Contextual Bandits

Assumptions:

  1. An action does not affect the future observations. The sequence of seen contexts is already assigned.
  2. The reward depends only on the used action and the context.

Implications:

  • There is no delayed future reward. Only the immediate reward is caused by the action. There is no confusion what caused it. That simplifies training.
  • It is not needed to use a stochastic policy. It cannot help us from being stuck. We cannot affect the observations. The chosen decision could be deterministic.
best_y = argmax_y E[total_reward(Contexts, y)]

4) Supervised Regression

The agent is presented with (x, target) examples.

Assumptions:

  1. The seen examples are independent. Their probability does not depend on the already seen examples.
  2. The examples are identically distributed. They share a P(Target, X) probability distribution.
  3. The reward function is known. It is possible to compute all possible rewards after seeing a target.

Implications:

  • The maximum of the expected total reward is at the same point as the maximum of the expected reward from a single example.
  • No exploration is needed. We cannot affect the future observations. And we can compare all possible rewards.
best_y = argmax_y E[reward(Target, y(X))]

The distribution P(X,Target) is still unknown.

5) Squared Loss

Assumptions:

  1. The reward function is known to be:
reward(target, y(x)) = - constant * (target - y(x))**2

Implications:

  • We can find the argmax_y explicitly. The derivative of the expected reward is a linear function of y. It leads to the following solution:
best_y(x) = E[Target|x]

Other simplifications could go in different directions. For example, planning assumes a known deterministic environment.

Used Offline Resources

The supervised learning problem is described in chapter 1 of Bishop's Pattern Recognition and Machine Learning.

No comments: