I will mention how different AI problems are related. I present them from the most general problems to the most specific problems. Each more specific problem includes also the assumptions from the previous simplifications.
The problems:
1) General Intelligence
The goal is to choose an action to maximize the future total reward:
best_y(observations) = argmax_action future_total_reward( observations, action)
This problem is considered by AIXI.
2) Reinforcement Learning
Assumptions:
- The environment is stationary. I.e., P(Trajectory) is a fixed probability distribution.
Implications:
- We can talk about the expected value of a function with respect to the probability distribution.
- A fixed policy is enough for the fixed environment.
best_policy = argmax_policy E[ total_reward(Trajectory)|policy] best_y(observations) = a draw from best_policy(Y|observations)
Note that it is not needed to compute the expected total reward. Its gradient is enough.
3) Contextual Bandits
Assumptions:
- An action does not affect the future observations. The sequence of seen contexts is already assigned.
- The reward depends only on the used action and the context.
Implications:
- There is no delayed future reward. Only the immediate reward is caused by the action. There is no confusion what caused it. That simplifies training.
- It is not needed to use a stochastic policy. It cannot help us from being stuck. We cannot affect the observations. The chosen decision could be deterministic.
best_y = argmax_y E[total_reward(Contexts, y)]
4) Supervised Regression
The agent is presented with (x, target) examples.
Assumptions:
- The seen examples are independent. Their probability does not depend on the already seen examples.
- The examples are identically distributed. They share a P(Target, X) probability distribution.
- The reward function is known. It is possible to compute all possible rewards after seeing a target.
Implications:
- The maximum of the expected total reward is at the same point as the maximum of the expected reward from a single example.
- No exploration is needed. We cannot affect the future observations. And we can compare all possible rewards.
best_y = argmax_y E[reward(Target, y(X))]
The distribution P(X,Target) is still unknown.
5) Squared Loss
Assumptions:
- The reward function is known to be:
reward(target, y(x)) = - constant * (target - y(x))**2
Implications:
- We can find the argmax_y explicitly. The derivative of the expected reward is a linear function of y. It leads to the following solution:
best_y(x) = E[Target|x]
Other simplifications could go in different directions. For example, planning assumes a known deterministic environment.
Used Offline Resources
The supervised learning problem is described in chapter 1 of Bishop's Pattern Recognition and Machine Learning.