Reinforcement Learning with Gym: A Beginner’s Guide to CartPole

Reinforcement learning (RL) has gained significant attention in the field of machine learning due to its ability to train agents to make optimal decisions through interaction with an environment. The CartPole environment, a classic benchmark for RL, is often used to demonstrate the fundamental concepts of this field. In this article, we will walk through the process of setting up and interacting with the CartPole environment using OpenAI’s Gym library.

What is CartPole?

The CartPole environment involves balancing a pole on a moving cart. The agent must control the cart to prevent the pole from falling over. The environment is considered solved when the agent can keep the pole balanced for a certain number of steps. The CartPole environment is a great way to introduce yourself to reinforcement learning concepts because of its simplicity and availability in popular libraries like Gym.

Setting up the Environment

To get started, you will need to install the necessary packages, including gym, stable-baselines3, numpy, and matplotlib. Let’s break down the process:

# Create a virtual environment
python -m venv venv

# Activate the virtual environment
# On Windows:
./venv/Scripts/activate

# On Linux or Mac:
source venv/bin/activate

# Upgrade pip and install dependencies
python.exe -m pip install --upgrade pip

# Install the required packages
pip install gym==0.26.2 stable-baselines3 numpy==1.26.4 matplotlib --cache-dir "D:/internship/reinforcement_learning/cartpole/.cache"

Interacting with the CartPole Environment

Once the environment is set up, you can begin interacting with it. Below is a simple script that demonstrates how to interact with the CartPole environment using random actions:

import gym

# Step 1: Create the CartPole environment with render_mode
env = gym.make('CartPole-v1', render_mode='human')

# Step 2: Reset the environment to the initial state
observation = env.reset()

# Step 3: Interact with the environment
done = False
total_reward = 0

# Run one episode where the agent takes random actions
while not done:
    env.render()  # This visualizes the environment

    # Step 4: Randomly select an action from the action space
    action = env.action_space.sample()

    # Step 5: Take the action and observe the new state, reward, done, truncated, and info
    observation, reward, done, truncated, info = env.step(action)
    
    # Handle episode ending when either done or truncated is True
    done = done or truncated

    # Accumulate the total reward
    total_reward += reward

    # Optional: Print the observation and reward at each step
    print(f"Observation: {observation}, Reward: {reward}")

# After the episode is done, close the environment
env.close()

# Step 6: Output the total reward for the episode
print(f"Total reward for the episode: {total_reward}")

Explanation of the Code

  1. Creating the Environment:
    • We use gym.make() to create the CartPole environment with the identifier 'CartPole-v1'. The render_mode='human' argument allows us to visualize the environment as the agent interacts with it.
  2. Resetting the Environment:
    • The environment is reset to its initial state using env.reset(), which returns the initial observation, a representation of the environment’s state.
  3. Interacting with the Environment:
    • The while not done loop ensures the agent keeps interacting with the environment until the episode ends. The env.render() method renders the current state, allowing us to visually track the CartPole.
  4. Random Actions:
    • The agent takes random actions using env.action_space.sample(). While this is far from optimal behavior, it helps to understand how actions affect the environment.
  5. Stepping Through the Environment:
    • env.step(action) advances the environment by one time step using the chosen action. This function returns the next observation, the reward for the action, and whether the episode has ended (done or truncated). The done flag indicates if the episode ends (e.g., when the pole falls), while truncated flags when a maximum time step is reached.
  6. Total Reward:
    • Throughout the episode, the rewards are accumulated in total_reward, giving a performance measure of the agent. Once the episode is complete, the total reward is printed.
  7. Closing the Environment:
    • After the episode finishes, we call env.close() to close the rendering window and properly clean up resources.

Output

Observation: [-0.04302248 -0.14830549  0.0392436   0.2897281 ], Reward: 1.0
Observation: [-0.04598859  0.04623553  0.04503816  0.00967591], Reward: 1.0
Observation: [-0.04506388  0.2406836   0.04523169 -0.26846367], Reward: 1.0
Observation: [-0.04025021  0.04494631  0.03986241  0.03813554], Reward: 1.0
Observation: [-0.03935128  0.23947462  0.04062512 -0.2417087 ], Reward: 1.0
Observation: [-0.03456179  0.43399343  0.03579095 -0.5213057 ], Reward: 1.0
Observation: [-0.02588192  0.6285938   0.02536483 -0.80249894], Reward: 1.0
Observation: [-0.01331005  0.8233589   0.00931485 -1.0870962 ], Reward: 1.0
Observation: [ 0.00315713  1.0183568  -0.01242707 -1.3768418 ], Reward: 1.0
Observation: [ 0.02352427  1.2136317  -0.0399639  -1.673385  ], Reward: 1.0
Observation: [ 0.0477969  1.0189959 -0.0734316 -1.3934107], Reward: 1.0
Observation: [ 0.06817682  1.214951   -0.10129982 -1.7081208 ], Reward: 1.0
Observation: [ 0.09247584  1.4110812  -0.13546224 -2.030539  ], Reward: 1.0
Observation: [ 0.12069746  1.2175933  -0.17607301 -1.7826703 ], Reward: 1.0
Observation: [ 0.14504933  1.414204   -0.21172643 -2.124525  ], Reward: 1.0
Total reward for the episode: 15.0

In your CartPole environment, the agent receives an observation (state) and reward at every timestep. Here’s what each part of the output means:

1. Observation:

The observation is an array of four numbers representing the state of the CartPole system at a given moment. These numbers describe the following:

  • Cart Position (observation[0]): The position of the cart on the track.
    • Negative values mean the cart is to the left of the center, positive values mean it’s to the right.
  • Cart Velocity (observation[1]): The velocity of the cart (how fast it is moving).
    • Negative values indicate the cart is moving left, and positive values indicate it is moving right.
  • Pole Angle (observation[2]): The angle of the pole relative to vertical.
    • Negative values mean the pole is tilting to the left, and positive values mean it’s tilting to the right.
  • Pole Velocity at the Tip (observation[3]): The angular velocity of the pole (how fast the pole is rotating).
    • Negative values indicate the pole is falling to the left, and positive values indicate it’s falling to the right.

2. Reward:

The reward is 1.0 for every timestep that the pole remains upright (i.e., the episode continues). In CartPole, the goal is to maximize this reward by keeping the pole balanced as long as possible. The agent receives a reward of 1 for every step it survives, and the episode ends when:

  • The pole falls past a certain angle.
  • The cart moves out of the bounds of the screen.
  • A predefined maximum number of steps is reached (this is an episodic task).

3. Total Reward for the Episode:

In this case, the total reward is 15.0, meaning the agent managed to balance the pole for 15 timesteps before the episode ended (either because the pole fell or another termination condition was met).

The number of timesteps corresponds directly to the reward: since the agent gets a reward of 1.0 per timestep, the total reward for the episode is the number of timesteps for which the pole remained balanced.

This suggests the episode ended after 15 timesteps, which is quite a short run—indicating that the agent didn’t balance the pole for very long.

Conclusion

In this tutorial, we demonstrated how to set up and interact with the CartPole environment using OpenAI’s Gym library. While we only used random actions, the same environment can be used with more sophisticated RL algorithms, such as those provided by the stable-baselines3 library, to train agents that perform significantly better. By understanding the basic structure of interacting with environments like CartPole, you can take your first steps into the exciting field of reinforcement learning!

Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x