Skip to main navigation Skip to search Skip to main content

On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers

  • Stony Brook University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

How much information do NLP tasks really need from a transformer's attention mechanism at application-time (inference)? From recent work, we know that there is sparsity in transformers and that the floating-points within its computation can be discretized to fewer values with minimal loss to task accuracies. However, this requires retraining or even creating entirely new models, both of which can be expensive and carbon-emitting. Focused on optimizations that do not require training, we systematically study the full range of typical attention values necessary. This informs the design of an inference-time quantization technique using both pruning and log-scaled mapping which produces only a few (e.g. 23) unique values. Over the tasks of question answering and sentiment analysis, we find nearly 80% of attention values can be pruned to zeros with minimal (< 1.0%) relative loss in accuracy. We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa.

Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics
Subtitle of host publicationACL-IJCNLP 2021
EditorsChengqing Zong, Fei Xia, Wenjie Li, Roberto Navigli
PublisherAssociation for Computational Linguistics (ACL)
Pages4147-4157
Number of pages11
ISBN (Electronic)9781954085541
DOIs
StatePublished - 2021
EventFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021 - Virtual, Online
Duration: Aug 1 2021Aug 6 2021

Publication series

NameFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Conference

ConferenceFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021
CityVirtual, Online
Period08/1/2108/6/21

Fingerprint

Dive into the research topics of 'On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers'. Together they form a unique fingerprint.

Cite this