Click4Ai

416.

Medium

In this problem, we will implement the Upper Confidence Bound (UCB) exploration strategy for reinforcement learning. The UCB algorithm balances exploration and exploitation by choosing the action with the highest estimated reward plus a bonus term that encourages exploration.

**Example:** Consider a simple grid world where an agent can move up, down, left, or right. The agent receives a reward of 1 for reaching a goal state and -1 for hitting a wall.

**Constraints:** Implement the UCB algorithm with a bonus term that increases with the number of times an action has been taken.

import numpy as np

def ucb_exploration(rewards, counts, bonus):

# Your code here

pass

Test Cases

Test Case 1
Input: [[1, 2], [3, 4]]
Expected: 0
Test Case 2
Input: [[10, 20], [30, 40]]
Expected: 1
+ 3 hidden test cases