Numeracy enhances the Literacy of Language Models

This repository holds the code and data (Wiki-Convert) for our EMNLP 2021 short paper. We show that magnitude-aware number encoders help language models predict words better, and the results transfer to non-numeric contexts as well. Here are some links to better understand our work:

Please reach out to me at thawani@usc.edu in case you face any issues or just to chat!

Dataset

Wiki-Convert: A novel dataset of Wikipedia sentences annotated with numbers. The easiest way to get the data is via Huggingface Datasets library. Simply install the datasets library and run:

from datasets import load_dataset
ds = load_dataset("usc-isi/WikiConvert")

Example:

id	comment	offset	length	number
0	With a total of 1500 miles of inland waterways, Alabama has among the most of any state.	16	4	1500

Here, the Wikipedia sentence is provided under the key comment and the annotated number is provided via its character offset and length, i.e., comment[offset:offset+length] = number. You will find additional keys UNIQUE_STORY_INDEX and magnitude which are irrelevant and were simply added for consistency with the format of the Numeracy600K dataset.

Note that when loading from the Datasets library, numbers larger than sys.maxsize will be capped to avoid an overflow in PyArrow. For the uncapped version, you may download the json files directly for the train, dev, and test splits.

The dataset sizes are as follows:

	Train	Dev	Test
# examples	739583	92447	92449
file size (MBs)	169	20.9	20.5

If you prefer the NUM and UNIT annotations as described in the paper, here is a 233 MB json file. You may also retrieve a larger, unprocessed version of the data at this link.

Code

train.py: model description and training

nice python train.py --batch-size 256 --gpus 0, --tsamples 100_000 --dsamples 10_000 --max_epochs 10 --enc exp --hidden 200 --accumulate_grad_batches 4 --seed 0 --dataset WC

eval.py: reports perplexity and hit@k scores

nice python eval.py --limit 10_000 --ckpt checkpoints/read-WC-def-adj-noun/epoch=9.ckpt --maxtoks 150 --batch-size 128 --device 0

dataset.py: tokenized dataset description

valids/common...txt: list of sentence indices to evaluate

Citation

Here's how to cite us for the results or the Wiki-Convert dataset:

@inproceedings{thawani-etal-2021-numeracy,
    title = "Numeracy enhances the Literacy of Language Models",
    author = "Thawani, Avijit  and
      Pujara, Jay  and
      Ilievski, Filip",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.557",
    pages = "6960--6967",
    abstract = "Specialized number representations in NLP have shown improvements on numerical reasoning tasks like arithmetic word problems and masked number prediction. But humans also use numeracy to make better sense of world concepts, e.g., you can seat 5 people in your {`}room{'} but not 500. Does a better grasp of numbers improve a model{'}s understanding of other concepts and words? This paper studies the effect of using six different number encoders on the task of masked word prediction (MWP), as a proxy for evaluating literacy. To support this investigation, we develop Wiki-Convert, a 900,000 sentence dataset annotated with numbers and units, to avoid conflating nominal and ordinal number occurrences. We find a significant improvement in MWP for sentences containing numbers, that exponent embeddings are the best number encoders, yielding over 2 points jump in prediction accuracy over a BERT baseline, and that these enhanced literacy skills also generalize to contexts without annotated numbers. We release all code at https://git.io/JuZXn.",
}

Name	Name	Last commit message	Last commit date
Latest commit avi-otterai Update readme.md Dec 7, 2021 55a40d5 · Dec 7, 2021 History 19 Commits
valids	valids	initial commit	Feb 12, 2021
.gitignore	.gitignore	ignore .DS_Store	Feb 14, 2021
404.md	404.md	Create 404.md	Mar 25, 2021
LICENSE	LICENSE	Initial commit	Feb 12, 2021
camera_ready.pdf	camera_ready.pdf	Add files via upload	Sep 9, 2021
dataset.py	dataset.py	initial commit	Feb 12, 2021
eval.py	eval.py	initial commit	Feb 12, 2021
index.html	index.html	Create index.html	Mar 25, 2021
paper.pdf	paper.pdf	Add files via upload	Nov 5, 2021
readme.md	readme.md	Update readme.md	Dec 7, 2021
requirements.txt	requirements.txt	initial commit	Feb 12, 2021
train.py	train.py	initial commit	Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Numeracy enhances the Literacy of Language Models

Dataset

Code

Citation

About

Releases

Packages

Languages

avi-otterai/numeracy-literacy

Folders and files

Latest commit

History

Repository files navigation

Numeracy enhances the Literacy of Language Models

Dataset

Code

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages