I will mention how different AI problems are related. I present them from the most general problems to the most specific problems. Each more specific problem includes also the assumptions from the previous simplifications.

The problems:

#### 1) General Intelligence

The goal is to choose an action to maximize the future total reward:

best_y(observations) = argmax_action future_total_reward( observations, action)

This problem is considered by AIXI.

#### 2) Reinforcement Learning

Assumptions:

- The environment is stationary. I.e., P(Trajectory) is a fixed probability distribution.

Implications:

- We can talk about the expected value of a function with respect to the probability distribution.
- A fixed policy is enough for the fixed environment.

best_policy = argmax_policy E[ total_reward(Trajectory)|policy] best_y(observations) = a draw from best_policy(Y|observations)

Note that it is not needed to compute the expected total reward. Its gradient is enough.

#### 3) Contextual Bandits

Assumptions:

- An action does not affect the future observations. The sequence of seen contexts is already assigned.
- The reward depends only on the used action and the context.

Implications:

- There is no delayed future reward. Only the immediate reward is caused by the action. There is no confusion what caused it. That simplifies training.
- It is not needed to use a stochastic policy. It cannot help us from being stuck. We cannot affect the observations. The chosen decision could be deterministic.

best_y = argmax_y E[total_reward(Contexts, y)]

#### 4) Supervised Regression

The agent is presented with (x, target) examples.

Assumptions:

- The seen examples are independent. Their probability does not depend on the already seen examples.
- The examples are identically distributed. They share a P(Target, X) probability distribution.
- The reward function is known. It is possible to compute all possible rewards after seeing a target.

Implications:

- The maximum of the expected total reward is at the same point as the maximum of the expected reward from a single example.
- No exploration is needed. We cannot affect the future observations. And we can compare all possible rewards.

best_y = argmax_y E[reward(Target, y(X))]

The distribution P(X,Target) is still unknown.

#### 5) Squared Loss

Assumptions:

- The reward function is known to be:

reward(target, y(x)) = - constant * (target - y(x))**2

Implications:

- We can find the argmax_y explicitly. The derivative of the expected reward is a linear function of y. It leads to the following solution:

best_y(x) = E[Target|x]

Other simplifications could go in different directions. For example, planning assumes a known deterministic environment.

### Used Offline Resources

The supervised learning problem is described in chapter 1 of Bishop's Pattern Recognition and Machine Learning.