Policy Iteration
===============
Policy iteration is a method for finding the optimal policy in a Markov decision process. It involves iterating between policy evaluation and policy improvement until convergence.
**Example:** We have a simple grid world with four states (A, B, C, D) and two actions (left, right). The reward function is as follows:
| State | Reward |
| --- | --- |
| A | 0 |
| B | 0 |
| C | 10 |
| D | 0 |
The goal is to find the optimal policy using policy iteration.
**Constraints:** The policy is stationary, and the reward function is deterministic.
Test Cases
Test Case 1
Input:
[[0, 1], [1, 0], [2, 3], [3, 2]]Expected:
[[0, 1], [1, 0], [0, 1], [1, 0]]+ 4 hidden test cases