Click4Ai

408.

Medium

Temporal Difference Learning

============================

In this problem, you will implement a Temporal Difference (TD) learning algorithm to estimate the value function of a given policy in a Markov Decision Process (MDP).

Example:

Suppose we have a simple MDP with two states and two actions. The reward function is defined as follows:

| State | Action | Next State | Reward |

| --- | --- | --- | --- |

| 0 | 0 | 0 | 1 |

| 0 | 0 | 1 | 0 |

| 0 | 1 | 0 | 0 |

| 0 | 1 | 1 | 1 |

| 1 | 0 | 0 | 1 |

| 1 | 0 | 1 | 0 |

| 1 | 1 | 0 | 0 |

| 1 | 1 | 1 | 1 |

We want to estimate the value function of a policy that always chooses action 0.

Constraints:

  • The MDP has two states and two actions.
  • The reward function is defined as above.
  • The policy is always to choose action 0.
  • Hint:

    Use NumPy to update the value function using the TD error.

    Test Cases

    Test Case 1
    Input: [[0, 1, 1, 0, 1], [1, 0, 0, 1, 0], [0, 0, 0, 0, 0], [1, 1, 1, 1, 1], [0, 0, 0, 0, 0]]
    Expected: [[1.0, 0.0], [0.0, 1.0]]
    Test Case 2
    Input: [[0, 1, 1, 0, 1], [1, 0, 0, 1, 0], [0, 0, 0, 0, 0], [1, 1, 1, 1, 1], [0, 0, 0, 0, 0]]
    Expected: [[1.0, 0.0], [0.0, 1.0]]
    + 3 hidden test cases