A simple collection of policy gradient algorithm implementations in PyTorch. This repository is designed for anyone looking to get hands-on experience with basic RL algorithms.
Experiments are conducted on the following environments:
If you'd like to try other Gym environments, you'll need to define the corresponding hyperparameters in hparams/ and adjust the scripts to handle the new observation and action spaces, and the network architecture.
The code is developed and tested with Python 3.11 and CUDA 12.1. Make sure you have them installed, and then follow the steps below to setup.
git clone https://github.com/keishihara/policy-gradients-pytorch.git
cd policy-gradients-pytorch
# Install dependencies via `uv`
uv sync
# Or install via `pip`
pip install -e .This repo currently contains the following classic policy gradient algorithms. All hyperparameters are stored in hparams/, and logs are automatically saved to logs/.
The simplest policy gradient algorithm, which optimizes the policy by following the gradient of the expected cumulative reward.
# CartPole-v1
python algos/reinforce/reinforce.py --cudaAn upgraded version of REINFORCE that incorporates a baseline to reduce variance in the gradient estimate.
# Train a policy on CartPole-v1
python algos/vpg/vpg_on_cc.py --cudaA training script for PongNoFrameskip-v4 is also provided, although it may not converge.
# Train a policy on PongNoFrameskip-v4
python algos/vpg/vpg_on_atari.py --cudaA more advanced algorithm that introduces a critic network to estimate the value function (state-dependent baseline), enabling it to solve Atari games with pixel observations.
# Train a policy on PongNoFrameskip-v4
python algos/a2c/a2c.py --cudaTrack training dynamics and performance metrics using TensorBoard:
tensorboard --logdir logs