Skip to main navigation Skip to search Skip to main content

Enabling Efficient SpMM for Sparse Attention on GEMM-Optimized Hardware with Block Aggregation

  • Stony Brook University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Rapidly growing context lengths have amplified the inherent sparsity in the attention mechanism of popular Large Language Models. However, the dynamic data access patterns required by sparse attention are challenging to realize using static data paths, leading to execution inefficiency. Existing SpMM hardware acceleration techniques address these inefficiencies by dynamically configuring data paths to align with the unstructured data access patterns of sparse attention. However, these approaches are not applicable to GEMM-optimized hardware, where dynamic data paths would introduce unacceptable hardware complexity and frequency degradation. We leverage the observation that non-zero values in sparse attention are not random, but exist in clusters due to the nature of the attention mechanism. We develop a technique that uses an index merge/sort module to dynamically aggregate unpruned values into small dense tiles, thus transforming post-attention SpMM into many smaller dense GEMM blocks, allowing efficient acceleration on dense GEMM-optimized hardware. We implemented our design on an Intel Stratix 10 NX FPGA with hardened Tensor Blocks, demonstrating performance improvements while preserving model accuracy. Specifically, we achieve speedups of 3.89×, 2.85×, and 2.66× for the attention-value product in popular chatglm2-6b-32k, llama2-7b-chat-4k, and mixtral-8x7b models, respectively, compared to a dense GEMM baseline on the same hardware. Compared to prior SpMM hardware [31], our design demonstrates improved scalability and up to 2.68× higher throughput.

Original languageEnglish
Title of host publicationProceedings of the 2026 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA 2026
PublisherAssociation for Computing Machinery, Inc
Pages67-78
Number of pages12
ISBN (Electronic)9798400720796
DOIs
StatePublished - Feb 21 2026
Event34th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2026 - Seaside, United States
Duration: Feb 22 2026Feb 24 2026

Publication series

NameProceedings of the 2026 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA 2026

Conference

Conference34th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2026
Country/TerritoryUnited States
CitySeaside
Period02/22/2602/24/26

Keywords

  • self-attention
  • sparse attention
  • Sparse-dense matrix multiplication
  • Tensor Block

Fingerprint

Dive into the research topics of 'Enabling Efficient SpMM for Sparse Attention on GEMM-Optimized Hardware with Block Aggregation'. Together they form a unique fingerprint.

Cite this