Skip to main navigation Skip to search Skip to main content

An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

  • University of Illinois at Urbana-Champaign

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Recent empirical studies show three phenomena with increasing size of language models: compute-optimal size scaling, emergent capabilities, and performance plateauing. We present a simple unified mathematical framework to explain all of these language model scaling phenomena, building on recent skill-text bipartite graph frameworks for semantic learning. Modeling the learning of concepts from texts as an iterative process yields an analogy to iterative decoding of low-density parity check (LDPC) codes in information theory. Thence, drawing on finite-size scaling characterizations of LDPC decoding, we derive the compute-optimal size scaling (Chinchilla rule) for language models. Further, using tools from random network theory, we provide a simple explanation for both emergence of complex skills and plateauing of performance as the size of language models scale. We see multiple plateaus.

Original languageEnglish
Pages (from-to)1338-1348
Number of pages11
JournalIEEE Journal on Selected Topics in Signal Processing
Volume19
Issue number7
DOIs
StatePublished - 2025

Keywords

  • Language models
  • composition of skills
  • emergence
  • low-density parity check (LDPC) codes
  • plateauing
  • scaling law
  • sequential learning

Fingerprint

Dive into the research topics of 'An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models'. Together they form a unique fingerprint.

Cite this