This contains the code for the masterthesis with the title "Classification of ICD-9 Codes from Unstructured Clinical Notes using Transformer-Based Neural Networks".
- This project is developed in python 3.8
- Install dependencies using the provided requirements.txt. Other package versions might work as well, but it is recommended to install the package versions as specified in the requirements.txt.
- The preprocessing is based on the preprocessing of the the CAML model architecture proposed by Mullenbach et al.
- First, edit the local and remote DATA_DIR, MIMIC_3_DIR and PROJECT_DIR in constants_mimic3.py to make them point to your respective data directories.
- Organize the data with the following structure:
mimicdata
| D_ICD_DIAGNOSES.csv
| D_ICD_PROCEDURES.csv
| ICD9_descriptions (already in repo)
|–––mimic3
| | NOTEEVENTS.csv
| | DIAGNOSES_ICD.csv
| | PROCEDURES_ICD.csv
| | *_hadm_ids.csv (already in repo)
Obtain the MIMIC-III files here: https://physionet.org/content/mimiciii/1.4/
- Run dataproc_mimic_III.ipynb. This might take a while.
- If you are curious, after running dataproc_mimic_III.ipynb you can run data_visualization.ipynb to get plots and statistics on the MIMIC-III and the MIMIC-III-50 dataset.
- To train one of the provided models, run sh train_<model_name>.sh in the scripts directory of the respective model directory.
- To test and reproduce the results for one of the models, run sh test_<model_name>.sh in the scripts directory of the respective model directory. This will load the best performing model obtained over k-fold split training for testing.
- To run inference and get predictions for one of the models, run sh inference_<model_name>.sh in the scripts directory of the respective model directory. This will load the best performing model obtained over k-fold split training for inference. The predictions are stored in the results directory.