Multi-label Text Classification

Holds code for collecting data from arXiv to build a multi-label text classification dataset and a simpler classifier on top of that. Our dataset is now available on Kaggle. The dataset collection process has been shown in this notebook. We leverage Apache Beam to design our data collection pipeline and our pipeline can be run on Dataflow at scale. We hope the data will be a useful benchmark for building multi-label text classification systems.

Here's an accompanying blog post on keras.io discussing the motivation behind this dataset, building a simple baseline model, etc.: Large-scale multi-label text classification.

Acknowledgements

We would like to thank Matt Watson for helping us build the simple baseline classifier model. Thanks to Lukas Schwab (author of arxiv.py) for helping us build our initial data collection utilities. Thanks to Robert Bradshaw for his inputs on the Apache Beam pipeline. Thanks to the ML-GDE program for providing GCP credits that allowed us to run the Beam pipeline at scale on Dataflow.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.gitignore		.gitignore
README.md		README.md
arxiv_scrape.ipynb		arxiv_scrape.ipynb
beam_arxiv_scrape.ipynb		beam_arxiv_scrape.ipynb
multi_label_trainer_tfidf.ipynb		multi_label_trainer_tfidf.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

arxiv_scrape.ipynb

arxiv_scrape.ipynb

beam_arxiv_scrape.ipynb

beam_arxiv_scrape.ipynb

multi_label_trainer_tfidf.ipynb

multi_label_trainer_tfidf.ipynb

Repository files navigation

Multi-label Text Classification

Acknowledgements

About

Releases 3

Packages

Contributors 2

Languages

soumik12345/multi-label-text-classification

Folders and files

Latest commit

History

Repository files navigation

Multi-label Text Classification

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Languages