Click4Ai

419.

Hard

Prioritized Experience Replay

=====================================

In this problem, you will implement Prioritized Experience Replay (PER), a technique used in Deep Reinforcement Learning to improve the stability and efficiency of training agents.

**Example:*

Suppose we have a simple Markov Decision Process (MDP) with two states (A and B) and two actions (left and right). The agent starts at state A and receives a reward of -1 for each step. The goal is to reach state B.

| State | Action | Next State | Reward |

| --- | --- | --- | --- |

| A | left | A | -1 |

| A | right | B | 10 |

| B | left | B | -1 |

| B | right | A | -1 |

The agent starts at state A and chooses the right action, reaching state B. The reward is 10, which is higher than the expected reward of -1. This experience should be prioritized in the replay buffer.

Constraints:

  • The replay buffer should store experiences with their corresponding priorities.
  • The agent should sample experiences from the replay buffer based on their priorities.
  • **Note:** This problem assumes you have a basic understanding of Deep Reinforcement Learning and the concept of experience replay.

    Test Cases

    Test Case 1
    Input: [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
    Expected: [[7, 8, 9, 0.5], [4, 5, 6, 0.3], [1, 2, 3, 0.2]]
    Test Case 2
    Input: [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
    Expected: [[70, 80, 90, 0.7], [40, 50, 60, 0.4], [10, 20, 30, 0.1]]
    + 3 hidden test cases