Click4Ai

429.

Hard

PPO Algorithm

======================

Description:

PPO (Proximal Policy Optimization) is a popular reinforcement learning algorithm that uses trust region methods to update the policy. The goal is to train an agent to make decisions in a complex environment.

Constraints:

  • The environment is a discrete state-action space.
  • The agent can take multiple actions in parallel.
  • Example:

    Suppose we have a simple grid world where the agent can move up, down, left, or right. The reward is 1 for reaching the goal state and -1 for hitting a wall.

    Test Cases

    Test Case 1
    Input: [[1,2,3],[4,5,6]]
    Expected: [[0.5,0.3,0.2],[0.7,0.2,0.1]]
    Test Case 2
    Input: [[7,8,9],[10,11,12]]
    Expected: [[0.6,0.4,0.0],[0.8,0.1,0.1]]
    + 3 hidden test cases