Abstract
Recent empirical studies show three phenomena with increasing size of language models: compute-optimal size scaling, emergent capabilities, and performance plateauing. We present a simple unified mathematical framework to explain all of these language model scaling phenomena, building on recent skill-text bipartite graph frameworks for semantic learning. Modeling the learning of concepts from texts as an iterative process yields an analogy to iterative decoding of low-density parity check (LDPC) codes in information theory. Thence, drawing on finite-size scaling characterizations of LDPC decoding, we derive the compute-optimal size scaling (Chinchilla rule) for language models. Further, using tools from random network theory, we provide a simple explanation for both emergence of complex skills and plateauing of performance as the size of language models scale. We see multiple plateaus.
| Original language | English |
|---|---|
| Pages (from-to) | 1338-1348 |
| Number of pages | 11 |
| Journal | IEEE Journal on Selected Topics in Signal Processing |
| Volume | 19 |
| Issue number | 7 |
| DOIs | |
| State | Published - 2025 |
Keywords
- Language models
- composition of skills
- emergence
- low-density parity check (LDPC) codes
- plateauing
- scaling law
- sequential learning
Fingerprint
Dive into the research topics of 'An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver