Skip to content

Latest commit

 

History

History
57 lines (41 loc) · 3.66 KB

01_tips-for-medgan.md

File metadata and controls

57 lines (41 loc) · 3.66 KB

Additional explanations on how to run Edward Choi's medGAN

In this markdown, I add a few tips that complete Edward Choi's README.md of his medGAN repository. Congrats to Edward's excellent work! I would like to thank Antoine Hamon for his help.

The goal of this markdown is just to run medGAN and not to obtain useful results: we try to minimize the computing time at the cost of having poorly realistic generated samples. My specs: Windows 10.

In this example, we use binary variables (and not count variables).

Edward Choi's medgan repository is composed of two programs that have since been updated for Python 3:

  • process_mimic.py (124 lines) inputs the public MIMIC-III dataset and outputs a suitable training dataset for medgan.py,
  • medgan.py (410 lines) inputs the output of process_mimic.py and outputs the generated (fake) multi-label discrete patient records.

1) Process the MIMIC-III dataset with process_mimic.py.

For this first step, we assume that you have already downloaded the MIMIC-III dataset. Otherwise, some tutorials such as Getting access to MIMIC III hospital database for data science projects provide helpful advice.

We need to open Anaconda Navigator, then go to Environments, click on the right triangle next to base (root) and Open Terminal: this opens a command prompt with the following path: (base) C:\Users\<username>.

In the command prompt, we change the directory to the folder where ADMISSIONS.csv and DIAGNOSES_ICD.csv from the MICMIC-III dataset (we only need these two) and the python codes are saved:

cd C:\Users\<username>\Documents\medgan-master

Still in the command prompt, we can then process the MIMIC-III dataset:

python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv training-data "binary"

This will create 3 files in our folder: training-data.matrix, training-data.pids and training-data.types.

2) Run medgan.py using the training-data.matrix file generated by process_mimic.py.

With the command python medgan.py --help, we can see all the parameters we can choose.

Still with the command prompt, we create a generated folder in our medgan-master folder:

mkdir generated

Then we use the command:

python medgan.py training-data.matrix ./generated/samples --data_type="binary" --n_epoch=10 --n_pretrain_epoch=10

Once again, the goal here is just to run medGAN, not to obtain useful results: we try to minimize the computing time by taking small values for n_epoch and n_pretrain_epoch. This will create 32 files in our generated folder: checkpoint, samples, samples-0.data-00000-of-00001, samples-0.index, samples-0.meta, samples-1.data-00000-of-00001, etc. In samples.txt, we can check the values of d_loss, g_loss, accuracy and AUC at each epoch. For comparison, the default value of n_epoch is 1000 and the default value of n_pretrain_epoch is 100.

Note that training-data.matrix must contain no missing values.

3) Generate synthetic records.

We use the command:

python medgan.py training-data.matrix gen-samples --model_file=./generated/samples-9 --generate_data=True --data_type="binary"

Instead of 9, we take the last epoch (here we took 10 epochs starting from the index 0). This will create the gen-samples.npy file in the medgan-master folder.

4) How to interpret gen-samples.npy?

Please check the table of contents of my medgan-tips repository in my README.md file.