## Expected SARSA
In this problem, you will implement the Expected SARSA algorithm, a type of off-policy reinforcement learning. Expected SARSA is used to learn the action-value function, Q(s, a), which estimates the expected return when taking action a in state s.
Example:
Suppose we have a grid world with four states and two actions (up and down). The reward function is as follows:
| State | Up | Down |
| --- | --- | --- |
| 1 | 0 | 1 |
| 2 | 1 | 0 |
| 3 | 0 | 1 |
| 4 | 1 | 0 |
Constraints:
Goal:
Implement the Expected SARSA algorithm to learn the action-value function Q(s, a).
Test Cases
Test Case 1
Input:
{"num_states": 4, "num_actions": 2}Expected:
array([[0.5, 0.5], [0.5, 0.5], [0.5, 0.5], [0.5, 0.5]])Test Case 2
Input:
{"num_states": 5, "num_actions": 3}Expected:
array([[0.33333333, 0.33333333, 0.33333333], [0.33333333, 0.33333333, 0.33333333], [0.33333333, 0.33333333, 0.33333333], [0.33333333, 0.33333333, 0.33333333], [0.33333333, 0.33333333, 0.33333333]])+ 3 hidden test cases