In this markdown, I add a few tips that complete Edward Choi's README.md
of his medGAN repository. Congrats to Edward's excellent work! I would like to thank Antoine Hamon for his help.
The goal of this markdown is just to run medGAN and not to obtain useful results: we try to minimize the computing time at the cost of having poorly realistic generated samples. My specs: Windows 10.
In this example, we use binary variables (and not count variables).
Edward Choi's medgan
repository is composed of two programs that have since been updated for Python 3:
process_mimic.py
(124 lines) inputs the public MIMIC-III dataset and outputs a suitable training dataset formedgan.py
,medgan.py
(410 lines) inputs the output ofprocess_mimic.py
and outputs the generated (fake) multi-label discrete patient records.
For this first step, we assume that you have already downloaded the MIMIC-III dataset. Otherwise, some tutorials such as Getting access to MIMIC III hospital database for data science projects provide helpful advice.
We need to open Anaconda Navigator, then go to Environments, click on the right triangle next to base (root) and Open Terminal: this opens a command prompt with the following path: (base) C:\Users\<username>
.
In the command prompt, we change the directory to the folder where ADMISSIONS.csv
and DIAGNOSES_ICD.csv
from the MICMIC-III dataset (we only need these two) and the python codes are saved:
cd C:\Users\<username>\Documents\medgan-master
Still in the command prompt, we can then process the MIMIC-III dataset:
python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv training-data "binary"
This will create 3 files in our folder: training-data.matrix
, training-data.pids
and training-data.types
.
With the command python medgan.py --help
, we can see all the parameters we can choose.
Still with the command prompt, we create a generated
folder in our medgan-master
folder:
mkdir generated
Then we use the command:
python medgan.py training-data.matrix ./generated/samples --data_type="binary" --n_epoch=10 --n_pretrain_epoch=10
Once again, the goal here is just to run medGAN, not to obtain useful results: we try to minimize the computing time by taking small values for n_epoch
and n_pretrain_epoch
.
This will create 32 files in our generated
folder: checkpoint
, samples
, samples-0.data-00000-of-00001
, samples-0.index
, samples-0.meta
, samples-1.data-00000-of-00001
, etc. In samples.txt
, we can check the values of d_loss
, g_loss
, accuracy
and AUC
at each epoch.
For comparison, the default value of n_epoch
is 1000 and the default value of n_pretrain_epoch
is 100.
Note that training-data.matrix
must contain no missing values.
We use the command:
python medgan.py training-data.matrix gen-samples --model_file=./generated/samples-9 --generate_data=True --data_type="binary"
Instead of 9, we take the last epoch (here we took 10 epochs starting from the index 0). This will create the gen-samples.npy
file in the medgan-master
folder.
Please check the table of contents of my medgan-tips
repository in my README.md
file.