Implement the Soft Actor-Critic (SAC) algorithm, a model-free reinforcement learning algorithm that combines policy gradient and value function methods. The algorithm uses an entropy term to encourage exploration. **Example:** Consider a simple grid world where an agent needs to navigate from a start state to a goal state. The agent receives a reward of -1 for each step taken. The goal is to learn a policy that maximizes the expected cumulative reward. **Constraints:** Use NumPy for numerical computations and ensure the algorithm converges to an optimal policy.
Test Cases
Test Case 1
Input:
{"env": "GridWorld", "max_episodes": 1000, "max_steps": 100, "gamma": 0.99, "alpha": 0.1}Expected:
policyTest Case 2
Input:
{"env": "CartPole", "max_episodes": 1000, "max_steps": 100, "gamma": 0.99, "alpha": 0.1}Expected:
policy+ 3 hidden test cases