Monte Carlo Methods
=======================
In this problem, you will implement a Monte Carlo method to estimate the value of a policy in a given environment.
Example:
Suppose we have a simple Markov Decision Process (MDP) with two states and two actions. The reward function is defined as follows:
| State | Action | Next State | Reward |
| --- | --- | --- | --- |
| 0 | 0 | 0 | 1 |
| 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 0 |
| 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 |
| 1 | 0 | 1 | 0 |
| 1 | 1 | 0 | 0 |
| 1 | 1 | 1 | 1 |
We want to estimate the value of a policy that always chooses action 0.
Constraints:
Hint:
Use NumPy to generate random trajectories and estimate the value function.
Test Cases
[[0, 1, 1, 0, 1], [1, 0, 0, 1, 0], [0, 0, 0, 0, 0], [1, 1, 1, 1, 1], [0, 0, 0, 0, 0]][[1.0, 0.0], [0.0, 1.0]][[0, 1, 1, 0, 1], [1, 0, 0, 1, 0], [0, 0, 0, 0, 0], [1, 1, 1, 1, 1], [0, 0, 0, 0, 0]][[1.0, 0.0], [0.0, 1.0]]