Click4Ai

424.

Hard

In this problem, you will implement the REINFORCE algorithm. The REINFORCE algorithm is a type of policy gradient method that uses the Monte Carlo method to estimate the policy gradient.

Example:

Suppose we have a policy π and an action a. The REINFORCE algorithm estimates the policy gradient as ∇J(π) = E[R(s, a)], where R(s, a) is the reward function.

Constraints:

You must use the REINFORCE algorithm to estimate the policy gradient.

Test Cases

Test Case 1
Input: [[1, 2], [3, 4]]
Expected: [1.5, 3.5]
Test Case 2
Input: [[5, 6], [7, 8]]
Expected: [5.5, 7.5]
+ 3 hidden test cases