In this problem, we will implement the Softmax exploration strategy for reinforcement learning. The Softmax algorithm chooses the action with the highest probability based on the estimated rewards.
**Example:** Consider a simple grid world where an agent can move up, down, left, or right. The agent receives a reward of 1 for reaching a goal state and -1 for hitting a wall.
**Constraints:** Implement the Softmax algorithm with a temperature parameter that controls the exploration-exploitation trade-off.
import numpy as np
def softmax_exploration(rewards, temperature):
# Your code here
pass
Test Cases
Test Case 1
Input:
[[1, 2], [3, 4]]Expected:
0Test Case 2
Input:
[[10, 20], [30, 40]]Expected:
0+ 3 hidden test cases