Dyna-Q Algorithm
**Example:** Consider a robot arm that can move up, down, left, or right. The robot receives a reward of +1 for reaching the goal and -1 for hitting a wall.
**Constraints:** The robot arm is 5x5, the robot starts at the top-left corner, and the goal is at the bottom-right corner. The robot can only move in the four main directions.
Implement the Dyna-Q algorithm to find the optimal policy for this robot arm.
Test Cases
Test Case 1
Input:
[[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 1]]Expected:
[[[0.1, 0.1, 0.1, 0.7], [0.1, 0.1, 0.1, 0.7], [0.1, 0.1, 0.1, 0.7], [0.1, 0.1, 0.1, 0.7], [0.1, 0.1, 0.1, 0.7]]]Test Case 2
Input:
[[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]Expected:
[[[0.1, 0.1, 0.1, 0.7], [0.1, 0.1, 0.1, 0.7], [0.1, 0.1, 0.1, 0.7], [0.1, 0.1, 0.1, 0.7], [0.1, 0.1, 0.1, 0.7]]]+ 3 hidden test cases