This repository contains the code for experiments in the Master Thesis "State Space Discretization for Reinforcement Learning". The thesis aimed at creating models that transform continuous, high-dimensional state spaces in reinforcement learning into discrete value embeddings, enabling the use of classical (tabular) reinforcement learning algorithms while retaining deep neural network generalization abilities. These models result in tabular policies, which are more interpretable to humans than policies generated through deep reinforcement learning with deep neural networks.
The models use overshooting LSTMs to produce continuous embeddings of the input state history, then convert them to discrete embeddings using various techniques. The models are tested on classical reinforcement learning algorithms, online, offline, and model-based, for a highly complex and partially observable Industrial Benchmark environment. The best results from these techniques are compared with the best results from the online deep reinforcement learning algorithm PPO. The results from the comparison are shown below:
Algorithm | Reward ± STDEV | Train Episodes | Eval Episodes | ||
---|---|---|---|---|---|
Deep Reinforcemet Learning | PPO (online) | -185.24 ± 0.88 | 10620 | 50 | |
Using Discretization Models and Tabular Reinforcement learning | Q-learning (online) | -181.54 ± 0.85 | 7000 | 50 | |
Q-learning (Model Based) | -184.87 ± 2.15 | 7000 | 50 | The live environment is replaced with model of the environment to predict next state. The rewards are calculated manually. | |
R-Min (offline) | -188.69 ± 0.64 | 10000 | 50 | Algorithm from the Paper |
This readme contains following sections:
- Setup
- Perform Experiment
- Evaluation
- Visualization
The dependencies for the experiments can be installed using following command:
pip install -r requirements.txt
The experiment consist of following tasks:
- Generate data for discretization model training
- Train LSTM based discretization models
- Use discretization models to train online and model based RL algorithms
- Generate data for offline RL training
- Train offline RL algorithms
A pipeline has been created to automate the entire experiment, with the data generation for the discretization model training task separate for consistency. This allows for reusing the same data instead of changing it each run. The experiments can be run by executing the following notebooks:
generate_ib_lstm_dataset.ipynb
(Only once on pulling the repo)pipeline.ipynb
Artifacts generated from each experiment run via the pipeline are saved in tmp/experiments/, which include discretization models, policies, etc.
To evaluate the policies generated by the experiments, execute following:
evaluation.ipynb
The results used in the thesis can be generated and visualized by running the notebook:
generate_results.ipynb