Click4Ai

413.

Hard

N-step TD Learning

Example:

In a simple grid world, an agent can move up, down, left, or right. The agent receives a reward of +1 for reaching the goal state. We want to learn the value function using N-step TD learning.

Constraints:

N = 3, gamma = 0.9, epsilon = 0.1, learning rate = 0.01

Implement N-step TD learning to learn the value function.

Test Cases

Test Case 1
Input: [[0.5, 0.3], [0.2, 0.1]]
Expected: [[0.51, 0.31], [0.21, 0.11]]
Test Case 2
Input: [[0.1, 0.2], [0.3, 0.4]]
Expected: [[0.11, 0.21], [0.31, 0.41]]
+ 3 hidden test cases