Holds code for collecting data from arXiv to build a multi-label text classification dataset and a simpler classifier on top of that. Our dataset is now available on Kaggle. The dataset collection process has been shown in this notebook. We leverage Apache Beam to design our data collection pipeline and our pipeline can be run on Dataflow at scale. We hope the data will be a useful benchmark for building multi-label text classification systems.
Here's an accompanying blog post on keras.io discussing the motivation behind this dataset, building a simple baseline model, etc.: Large-scale multi-label text classification.
We would like to thank Matt Watson for helping us build the simple baseline classifier model. Thanks to
Lukas Schwab (author of arxiv.py
) for helping us build
our initial data collection utilities. Thanks to Robert Bradshaw for his inputs
on the Apache Beam pipeline. Thanks to the ML-GDE program for providing GCP credits
that allowed us to run the Beam pipeline at scale on Dataflow.