TY - GEN
T1 - Enabling Efficient SpMM for Sparse Attention on GEMM-Optimized Hardware with Block Aggregation
AU - Ji, Tianchu
AU - Balasubramanian, Niranjan
AU - Ferdman, Michael
AU - Milder, Peter
N1 - Publisher Copyright:
© 2026 Copyright held by the owner/author(s).
PY - 2026/2/21
Y1 - 2026/2/21
N2 - Rapidly growing context lengths have amplified the inherent sparsity in the attention mechanism of popular Large Language Models. However, the dynamic data access patterns required by sparse attention are challenging to realize using static data paths, leading to execution inefficiency. Existing SpMM hardware acceleration techniques address these inefficiencies by dynamically configuring data paths to align with the unstructured data access patterns of sparse attention. However, these approaches are not applicable to GEMM-optimized hardware, where dynamic data paths would introduce unacceptable hardware complexity and frequency degradation. We leverage the observation that non-zero values in sparse attention are not random, but exist in clusters due to the nature of the attention mechanism. We develop a technique that uses an index merge/sort module to dynamically aggregate unpruned values into small dense tiles, thus transforming post-attention SpMM into many smaller dense GEMM blocks, allowing efficient acceleration on dense GEMM-optimized hardware. We implemented our design on an Intel Stratix 10 NX FPGA with hardened Tensor Blocks, demonstrating performance improvements while preserving model accuracy. Specifically, we achieve speedups of 3.89×, 2.85×, and 2.66× for the attention-value product in popular chatglm2-6b-32k, llama2-7b-chat-4k, and mixtral-8x7b models, respectively, compared to a dense GEMM baseline on the same hardware. Compared to prior SpMM hardware [31], our design demonstrates improved scalability and up to 2.68× higher throughput.
AB - Rapidly growing context lengths have amplified the inherent sparsity in the attention mechanism of popular Large Language Models. However, the dynamic data access patterns required by sparse attention are challenging to realize using static data paths, leading to execution inefficiency. Existing SpMM hardware acceleration techniques address these inefficiencies by dynamically configuring data paths to align with the unstructured data access patterns of sparse attention. However, these approaches are not applicable to GEMM-optimized hardware, where dynamic data paths would introduce unacceptable hardware complexity and frequency degradation. We leverage the observation that non-zero values in sparse attention are not random, but exist in clusters due to the nature of the attention mechanism. We develop a technique that uses an index merge/sort module to dynamically aggregate unpruned values into small dense tiles, thus transforming post-attention SpMM into many smaller dense GEMM blocks, allowing efficient acceleration on dense GEMM-optimized hardware. We implemented our design on an Intel Stratix 10 NX FPGA with hardened Tensor Blocks, demonstrating performance improvements while preserving model accuracy. Specifically, we achieve speedups of 3.89×, 2.85×, and 2.66× for the attention-value product in popular chatglm2-6b-32k, llama2-7b-chat-4k, and mixtral-8x7b models, respectively, compared to a dense GEMM baseline on the same hardware. Compared to prior SpMM hardware [31], our design demonstrates improved scalability and up to 2.68× higher throughput.
KW - self-attention
KW - sparse attention
KW - Sparse-dense matrix multiplication
KW - Tensor Block
UR - https://www.scopus.com/pages/publications/105032987932
U2 - 10.1145/3748173.3779187
DO - 10.1145/3748173.3779187
M3 - Conference contribution
AN - SCOPUS:105032987932
T3 - Proceedings of the 2026 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA 2026
SP - 67
EP - 78
BT - Proceedings of the 2026 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA 2026
PB - Association for Computing Machinery, Inc
T2 - 34th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2026
Y2 - 22 February 2026 through 24 February 2026
ER -