Skip to content

This repo contains transformer-based encoder-decoder architectures that are applied on the task of ICD9-coding on the MIMIC-III-50 dataset. It is the code for the masterthesis "Classification of ICD-9 Codes from Unstructured Clinical Notes using Transformer-Based Neural Networks" of Malte Feucht at the chair of Computer Aided Medical Procedures …


Repository files navigation


This contains the code for the masterthesis with the title "Classification of ICD-9 Codes from Unstructured Clinical Notes using Transformer-Based Neural Networks".


  • This project is developed in python 3.8
  • Install dependencies using the provided requirements.txt. Other package versions might work as well, but it is recommended to install the package versions as specified in the requirements.txt.


  • The preprocessing is based on the preprocessing of the the CAML model architecture proposed by Mullenbach et al.
  • First, edit the local and remote DATA_DIR, MIMIC_3_DIR and PROJECT_DIR in to make them point to your respective data directories.
  • Organize the data with the following structure:

|      D_ICD_DIAGNOSES.csv
|      ICD9_descriptions (already in repo)
|      |      NOTEEVENTS.csv
|      |      DIAGNOSES_ICD.csv
|      |      PROCEDURES_ICD.csv
|      |      *_hadm_ids.csv (already in repo)

Obtain the MIMIC-III files here:

  • Run dataproc_mimic_III.ipynb. This might take a while.
  • If you are curious, after running dataproc_mimic_III.ipynb you can run data_visualization.ipynb to get plots and statistics on the MIMIC-III and the MIMIC-III-50 dataset.


  • To train one of the provided models, run sh train_<model_name>.sh in the scripts directory of the respective model directory.


  • To test and reproduce the results for one of the models, run sh test_<model_name>.sh in the scripts directory of the respective model directory. This will load the best performing model obtained over k-fold split training for testing.


  • To run inference and get predictions for one of the models, run sh inference_<model_name>.sh in the scripts directory of the respective model directory. This will load the best performing model obtained over k-fold split training for inference. The predictions are stored in the results directory.


This repo contains transformer-based encoder-decoder architectures that are applied on the task of ICD9-coding on the MIMIC-III-50 dataset. It is the code for the masterthesis "Classification of ICD-9 Codes from Unstructured Clinical Notes using Transformer-Based Neural Networks" of Malte Feucht at the chair of Computer Aided Medical Procedures …






No releases published


No packages published