In this problem, we will implement the Upper Confidence Bound (UCB) exploration strategy for reinforcement learning. The UCB algorithm balances exploration and exploitation by choosing the action with the highest estimated reward plus a bonus term that encourages exploration.
**Example:** Consider a simple grid world where an agent can move up, down, left, or right. The agent receives a reward of 1 for reaching a goal state and -1 for hitting a wall.
**Constraints:** Implement the UCB algorithm with a bonus term that increases with the number of times an action has been taken.
import numpy as np
def ucb_exploration(rewards, counts, bonus):
# Your code here
pass
Test Cases
Test Case 1
Input:
[[1, 2], [3, 4]]Expected:
0Test Case 2
Input:
[[10, 20], [30, 40]]Expected:
1+ 3 hidden test cases