Additional explanations on how to run Edward Choi's medGAN

In this markdown, I add a few tips that complete Edward Choi's README.md of his medGAN repository. Congrats to Edward's excellent work! I would like to thank Antoine Hamon for his help.

The goal of this markdown is just to run medGAN and not to obtain useful results: we try to minimize the computing time at the cost of having poorly realistic generated samples. My specs: Windows 10.

In this example, we use binary variables (and not count variables).

Edward Choi's medgan repository is composed of two programs that have since been updated for Python 3:

process_mimic.py (124 lines) inputs the public MIMIC-III dataset and outputs a suitable training dataset for medgan.py,
medgan.py (410 lines) inputs the output of process_mimic.py and outputs the generated (fake) multi-label discrete patient records.

1) Process the MIMIC-III dataset with `process_mimic.py`.

For this first step, we assume that you have already downloaded the MIMIC-III dataset. Otherwise, some tutorials such as Getting access to MIMIC III hospital database for data science projects provide helpful advice.

We need to open Anaconda Navigator, then go to Environments, click on the right triangle next to base (root) and Open Terminal: this opens a command prompt with the following path: (base) C:\Users\<username>.

In the command prompt, we change the directory to the folder where ADMISSIONS.csv and DIAGNOSES_ICD.csv from the MICMIC-III dataset (we only need these two) and the python codes are saved:

cd C:\Users\<username>\Documents\medgan-master

Still in the command prompt, we can then process the MIMIC-III dataset:

python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv training-data "binary"

This will create 3 files in our folder: training-data.matrix, training-data.pids and training-data.types.

2) Run `medgan.py` using the `training-data.matrix` file generated by `process_mimic.py`.

With the command python medgan.py --help, we can see all the parameters we can choose.

Still with the command prompt, we create a generated folder in our medgan-master folder:

mkdir generated

Then we use the command:

python medgan.py training-data.matrix ./generated/samples --data_type="binary" --n_epoch=10 --n_pretrain_epoch=10

Once again, the goal here is just to run medGAN, not to obtain useful results: we try to minimize the computing time by taking small values for n_epoch and n_pretrain_epoch. This will create 32 files in our generated folder: checkpoint, samples, samples-0.data-00000-of-00001, samples-0.index, samples-0.meta, samples-1.data-00000-of-00001, etc. In samples.txt, we can check the values of d_loss, g_loss, accuracy and AUC at each epoch. For comparison, the default value of n_epoch is 1000 and the default value of n_pretrain_epoch is 100.

Note that training-data.matrix must contain no missing values.

3) Generate synthetic records.

We use the command:

python medgan.py training-data.matrix gen-samples --model_file=./generated/samples-9 --generate_data=True --data_type="binary"

Instead of 9, we take the last epoch (here we took 10 epochs starting from the index 0). This will create the gen-samples.npy file in the medgan-master folder.

4) How to interpret `gen-samples.npy`?

Please check the table of contents of my medgan-tips repository in my README.md file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

01_tips-for-medgan.md

01_tips-for-medgan.md

Additional explanations on how to run Edward Choi's medGAN

1) Process the MIMIC-III dataset with `process_mimic.py`.

2) Run `medgan.py` using the `training-data.matrix` file generated by `process_mimic.py`.

3) Generate synthetic records.

4) How to interpret `gen-samples.npy`?

Files

01_tips-for-medgan.md

Latest commit

History

01_tips-for-medgan.md

File metadata and controls

Additional explanations on how to run Edward Choi's medGAN

1) Process the MIMIC-III dataset with process_mimic.py.

2) Run medgan.py using the training-data.matrix file generated by process_mimic.py.

3) Generate synthetic records.

4) How to interpret gen-samples.npy?

1) Process the MIMIC-III dataset with `process_mimic.py`.

2) Run `medgan.py` using the `training-data.matrix` file generated by `process_mimic.py`.

4) How to interpret `gen-samples.npy`?