In this problem, you will be implementing the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. The TD3 algorithm is a type of model-free reinforcement learning algorithm that uses two neural networks to learn the policy and two neural networks to learn the critic. The goal is to implement the TD3 algorithm to learn a policy that maximizes the cumulative reward. **Example:** If the environment is a simple grid world, the policy could be to move up, down, left, or right with a probability of 0.25 each. **Constraints:** The algorithm should use two neural networks to learn the policy and two neural networks to learn the critic. The policy should be a deterministic policy.
Test Cases
Test Case 1
Input:
[[1, 2], 3]Expected:
[0.5, 0.5]Test Case 2
Input:
[[4, 5], 6]Expected:
[0.8, 0.8]+ 3 hidden test cases