Click4Ai

410.

Medium

## SARSA Algorithm

In this problem, you will implement the SARSA algorithm, a type of on-policy reinforcement learning. SARSA is used to learn the action-value function, Q(s, a), which estimates the expected return when taking action a in state s.

Example:

Suppose we have a grid world with four states and two actions (up and down). The reward function is as follows:

| State | Up | Down |

| --- | --- | --- |

| 1 | 0 | 1 |

| 2 | 1 | 0 |

| 3 | 0 | 1 |

| 4 | 1 | 0 |

Constraints:

  • The learning rate is 0.1.
  • The discount factor is 0.9.
  • The exploration rate is 0.1.
  • The number of episodes is 100.
  • Goal:

    Implement the SARSA algorithm to learn the action-value function Q(s, a).

    Test Cases

    Test Case 1
    Input: {"num_states": 4, "num_actions": 2}
    Expected: array([[0.5, 0.5], [0.5, 0.5], [0.5, 0.5], [0.5, 0.5]])
    Test Case 2
    Input: {"num_states": 5, "num_actions": 3}
    Expected: array([[0.33333333, 0.33333333, 0.33333333], [0.33333333, 0.33333333, 0.33333333], [0.33333333, 0.33333333, 0.33333333], [0.33333333, 0.33333333, 0.33333333], [0.33333333, 0.33333333, 0.33333333]])
    + 3 hidden test cases