Deep Q Learning for Top-Down 2D Games
Train a deep Q-network to master the classic game of Snake. Watch as the computer learns in real time, evolving from novice to expert through continuous gameplay and self-improvement.
The rules of the game are simple.
- The snake's length represents the player's score.
- Each time the snake eats an apple, it grows by one chunk.
- The game ends if the snake head hits a wall or its own body.
- Deep Q Learning for Top-Down 2D Games
- 1. General Overview
- 2. Setup
- 3. Core Files
- 4. Q-Learning Overview
- 5. Feature Roadmap
- 6. References
- Reinforcement learning is a style of machine learning that relies on past experience to develop a policy on what do to next.
- Deep neural networks are used in machine learning to set up decision pathways that maximize reward (i.e. minimize loss or error).
- Q-learning is a subclass of reinforcement learning that is model-free, meaning that it does not require a model of the environment in which it operates and can learn directly from raw experiences.
- In this exploration, a policy is a strategy to win the game of snake, and we want to find the optimal one.
- The agent executes the policy.
- The policy begins as "move randomly and hope for the best".
- As the agent gains experience, the policy becomes less random and moves closer to an optimal solution.
The more the agent plays, the more its random movements are replaced by policy-predicted movements.
If you are fairly new to Python programming, I'd recommend the following steps:
-
Download and install VS Code.
-
Install Python 3.12.4 (☑️ Add python.exe to PATH if you have no other Python versions installed).
-
Install Git bash.
-
Open VS Code.
-
Press
F1
, and in the command palette, search forTerminal: Select Default Profile
and set Git bash as the default terminal. -
Start a new terminal with
Ctrl
+`
. -
Clone this repository to a directory where you like to store your coding projects.
-
Open this repository (i.e. the
qsnake
folder) as the current workspace folder withCtrl
+K
Ctrl
+O
. -
Make sure the terminal path points to the
qsnake
folder, and if it doesn't, navigate there viacd <path_to_qsnake_folder>
. You can confirm you're in the right spot with quickls -la
command. -
From the terminal, run
pip install virtualenv
to install thevirtualenv
module. -
Run
python -m virtualenv <myenvname> --python=python3.12.4
to create a virtual environment that runs on Python 3.12.4. -
Activate the virtual environment with
source <myenvname>/Scripts/activate
. -
You should see
(<myenvname>)
two lines above the terminal input line when the environment is active. -
Press
F1
to open VS Code's command palette, then search forPython: Select Interpreter
and selectPython 3.12.4 64-bit ('<myenvname>':venv)
. -
Run
pip install -r requirements.txt
to install all dependencies on your activated virtual environment. -
Next, ensure the
"human"
option inconfigs/config.json
is set totrue
. -
Once everything is installed and configured, run
python src/q_snake.py
to test if you can play the game manually. -
Next, specify
"human": false
inconfig.json
, and then runpython src/q_snake.py
again, this time to see if the agent is able to play the game. -
Let the agent run to the end and check that
plotting.py
is able to produce a graph of the learning curve. -
Play with a few settings in
config.json
and re-runpython src/q_snake.py
to see how the changes affect the agent's behavior. Feel free to do this until you get bored or it sparks a question you want to explore. -
OPTIONAL - If you want to utilize the GIF-making supporting scripts, install Ghostscript and note the path of the binary. You will need to change a line in the preamble of gif_creator.py to specify where the binary is located.
-
OPTIONAL - If you were bold enough to install Ghostscript, try saving game frames as EPS files. You can then run
python src/make_gif_from_images.py
to convert the EPS files into PNG files and then into an animated GIF. -
OPTIONAL - Try converting the saved EPS files into PNG and then into a GIF all at once by specifying both
"save_for_gif": true
and"make_gif": true
inconfig.json
. Please note that this process can take 30 to 60 minutes for 50 training episodes.
Run python src/q_snake.py
to interface with this project.
Define how the script should run. All of the keys in the default configuration included below are required along with their example data types.
{
"human": true,
"name": "default",
"save_for_gif": false,
"make_gif": false,
"params":{
"epsilon": 1.0,
"gamma": 0.95,
"batch_size": 256,
"epsilon_min": 0.001,
"epsilon_decay": 0.98,
"learning_rate": 0.00025,
"layer_sizes": [128, 128, 128],
"num_episodes": 20,
"max_steps": 3000,
"state_definition_type": "default"
}
}
Whether to run so a human can play with the arrow keys, or to let the agent play by itself.
The name of the configuration.
Whether to save EPS files of the agent as it trains in preparation for making a training montage GIF.
Whether to make a training montage GIF at the end of the run.
Initial ratio of random versus policy-predicted steps.
Discount factor for future rewards (0 is short-sighted, 1 is long-sighted).
The number of training samples processed before the model's internal parameters are updated during one iteration of training.
The minimum ratio of time steps we'd like the agent to move randomly versus in a policy-predicted direction.
How much randomness we want to take into the next iteration of gathering a batch of states. For example, if we move randomly 50% of the time for the current batch and epsilon_decay
is set to 0.95, we will move randomly 47% of the time on the next batch.
To what extent newly acquired info overrides old info (0 learn nothing and exploit prior knowledge exclusively; 1 only consider the most recent information).
The number of nodes for the hidden layers of our deeply-connected Q network.
The number of games to play.
The maximum number of steps allowable in a single game.
- The number of training examples used in the estimate of the error gradient is a hyperparameter for the learning algorithm called the "batch size" (or simply "the batch").
- A hyperparameter is a variable set before learning begins.
- Note that the error gradient is a statistical estimate.
- The more training examples used in the estimate
- The more accurate the estimate will be.
- The more likely the network will adjust such that the model's overall performance improves.
- An improved estimate of the error gradient comes at a cost.
- It requires the model to make many more predictions before the estimate can be calculated and the weights updated.
Optimization algorithms that use the entire training set are called batch or deterministic gradient methods, because they process all of the training examples simultaneously in a large batch.
Optimization algorithms that use only a single example at a time are sometimes called stochastic or sometimes online methods. The term online is usually reserved for the case where the examples are drawn from a stream of continually created examples rather than from a fixed-size training set over which several passes are made.
A batch size of 32 means that 32 samples from the training dataset will be used to estimate the error gradient before the model weights are updated. One training epoch means that the learning algorithm has made one pass through the training dataset, where examples were separated into randomly selected "batch size" groups.
Historically, a training algorithm where the batch size is set to the total number of training examples is called "batch gradient descent" and a training algorithm where the batch size is set to 1 training example is called "stochastic gradient descent" or "online gradient descent".
Put another way, the batch size defines the number of samples that must be propagated through the network before the weights can be updated.
Gradient Descent Type | Batch Size |
---|---|
Batch Gradient Descent | all training samples |
Stochastic Gradient Descent | 1 |
Minibatch Gradient Descent | 1 < batch size < all training samples |
The modules required for running the project. After creating a virtual environment with Python 3.12.4, run pip install -r requirements.txt
to install all necessary dependencies.
This subclass of gymnasium.Env
represents the Snake game environment (which includes the snake itself).
Train an agent to play Snake via a deep Q-learning network.
Graph some statistics about how the Q-network was trained and the performance of the agent.
Convert saved EPS files into PNG files and then PNG files into an animated GIF. Using this class requires a separate installation of Ghostscript, and EpsImagePlugin.gs_windows_binary = <path_to_binary>
must be appropriately edited within gif_builder.py
.
Convert already-saved EPS or PNG image files into a GIF without having to re-run the game. Again, Ghostscript is required.
4.1 What is the Bellman Equation?
The Bellman equation is a fundamental recursive formula in reinforcement learning that expresses the value (i.e. Q-value) of a state in terms of the expected reward and the values of subsequent states. It has an abstract general definition, but in Q-learning, it's more easily understood when rearranged into the Q-value update rule.
-
$Q(s_t,a_t)$ is the Q-value of taking action$a$ in state$s$ at time$t$ . -
$\alpha$ is the learning rate, which determines the extent to which new experience overrides past experience. -
$r_t$ is the immediate reward received after taking action$a$ in state$s$ at time$t$ . -
$\gamma$ is the discount factor, which balances the importance of immediate versus future rewards. -
$s_{t+1}$ is the next state resulting from taking action$a$ in state$s$ at time$t$ . -
$\underset{a}{\max},Q(s_{t+1},a)$ is the maximum Q-value for the next state$s_{t+1}$ over all possible actions$a$ .
An episode of the algorithm ends when state done
state in agent.py
). For all final states
-
$\alpha$ determines to what extent newly acquired information overrides old information.$\alpha\in(0,1]$ - As
$\alpha$ approaches 0, the agent more exclusively uses prior knowledge. - As
$\alpha$ approaches 1, the agent more exclusively considers the most recent data. - In deterministic environments, a value of 1 is optimal.
- In environments with some randomness,
$\alpha$ needs to eventually decrease to zero (meaning an optimal strategy is found).- In practice, however, using a constant learning rate works just fine.
-
$\gamma$ determines the importance of future rewards.$\gamma\in(0,1]$ - As
$\gamma$ approaches 0, the agent further prioritizes short-term rewards. - As
$\gamma$ approaches 1, the agent more exclusively acts to achieve long-term rewards. - Starting with a small discount factor and increasing it toward a final value over time tends to accelerate learning.
In the Snake game, the agent (the brain controlling the snake) navigates the game grid, eats food to grow, and avoids collisions with itself and the walls.
- State
$s$ - A configuration of the game (i.e. the position of the snake, the walls, and the food).
- Action
$a$ - Possible moves (i.e. up, down, left, or right).
- Reward
$r$ - The reward received for taking action
$a$ in state$s$ .Reward Type Description $+$ Reward for eating food. $-$ Penalty for colliding with walls or itself. $0$ No reward for moving without eating food.
- The reward received for taking action
-
Initialize DQN
- Start with an empty deeply-connected neural net where all Q-values are zero.
-
Observe Current State
- Record the current position of the snake and the apple on the grid.
-
Choose Action
- Choose to act based more on past experience or for random exploration (based on
$\epsilon$ , another hyperparameter set inconfig.json
underepsilon_decay
).
- Choose to act based more on past experience or for random exploration (based on
-
Execute Action and Observe Reward
- Take the action
$a$ , move the snake, then receive the immediate reward$r$ .
- Take the action
-
Observe Next State
- Record the new position of the snake and the apple on the grid.
-
Update Q-Value
- Apply the Bellman equation to update the Q-value for the state-action pair
$(s, a)$
- Apply the Bellman equation to update the Q-value for the state-action pair
-
Iterate
- Repeat the process, updating the weights in the DQN as the agent continues to play.
Here are some ideas for future development.
- Enable model saving functionality.
- Allow previously trained models to play with saved settings.
- Allow previously trained models to play and resume training.
Here are the key references I used to develop this project.