Q-Learning Algorithm
======================
In this problem, you will implement the Q-learning algorithm to learn an optimal policy in a Markov Decision Process (MDP).
Example:
Suppose we have a simple MDP with two states and two actions. The reward function is defined as follows:
| State | Action | Next State | Reward |
| --- | --- | --- | --- |
| 0 | 0 | 0 | 1 |
| 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 0 |
| 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 |
| 1 | 0 | 1 | 0 |
| 1 | 1 | 0 | 0 |
| 1 | 1 | 1 | 1 |
We want to learn an optimal policy using Q-learning.
Constraints:
Hint:
Use NumPy to update the Q-values using the Q-learning update rule.
Test Cases
Test Case 1
Input:
[[0, 1, 1, 0, 1], [1, 0, 0, 1, 0], [0, 0, 0, 0, 0], [1, 1, 1, 1, 1], [0, 0, 0, 0, 0]]Expected:
[[0.0, 1.0], [1.0, 0.0]]Test Case 2
Input:
[[0, 1, 1, 0, 1], [1, 0, 0, 1, 0], [0, 0, 0, 0, 0], [1, 1, 1, 1, 1], [0, 0, 0, 0, 0]]Expected:
[[0.0, 1.0], [1.0, 0.0]]+ 3 hidden test cases