In this problem, you will implement the REINFORCE algorithm. The REINFORCE algorithm is a type of policy gradient method that uses the Monte Carlo method to estimate the policy gradient.
Example:
Suppose we have a policy π and an action a. The REINFORCE algorithm estimates the policy gradient as ∇J(π) = E[R(s, a)], where R(s, a) is the reward function.
Constraints:
You must use the REINFORCE algorithm to estimate the policy gradient.
Test Cases
Test Case 1
Input:
[[1, 2], [3, 4]]Expected:
[1.5, 3.5]Test Case 2
Input:
[[5, 6], [7, 8]]Expected:
[5.5, 7.5]+ 3 hidden test cases