Click4Ai

423.

Hard

In this problem, you will prove the Policy Gradient Theorem. The Policy Gradient Theorem states that the gradient of the expected return with respect to the policy parameters is equal to the expected return gradient with respect to the policy.

Example:

Suppose we have a policy π and an action a. The expected return is defined as J(π) = E[R(s, a)], where R(s, a) is the reward function. The policy gradient is defined as ∇J(π) = ∇E[R(s, a)].

Constraints:

You must prove the Policy Gradient Theorem, which states that ∇J(π) = E[∇R(s, a)].

Test Cases

Test Case 1
Input: [[1, 2], [3, 4]]
Expected: [1.5, 3.5]
Test Case 2
Input: [[5, 6], [7, 8]]
Expected: [5.5, 7.5]
+ 3 hidden test cases