Skip to main navigation Skip to search Skip to main content

TensorBank: Tensor Lakehouse for Foundation Model Training

  • Romeo Kienzler
  • , Johannes Schmude
  • , Naomi Simumba
  • , Benedikt Blumenstiel
  • , Marcus Freitag
  • , Daiki Kimura
  • , Zoltan Arnold Nagy
  • , Michael Behrendt
  • , Hendrik Hamann
  • , S. Karthik Mukkavilli
  • , Daniel Salles Civitarese
  • IBM

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Storing and streaming high dimensional data for foundation model training became a critical requirement with the rise of foundation models beyond natural language. In this paper we introduce TensorBank - a petabyte scale tensor lakehouse capable of streaming tensors from Cloud Object Store (COS) to GPU memory at wire speed based on complex relational queries. We use Hierarchical Statistical Indices (HSI) for query acceleration. Our architecture allows to directly address tensors on block level using HTTP range reads. Once in GPU memory, data can be transformed using PyTorch transforms. We provide a generic PyTorch dataset type with a corresponding dataset factory translating relational queries and requested transformations as an instance. By making use of the HSI, irrelevant blocks can be skipped without reading them as those indices contain statistics on their content at different hierarchical resolution levels. This is an opinionated architecture powered by open standards and making heavy use of open-source technology. Although, hardened for production use using geospatial-temporal data, this architecture generalizes to other use cases like computer vision, computational neuroscience, biological sequence analysis and more.

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE International Conference on Big Data, BigData 2023
EditorsJingrui He, Themis Palpanas, Xiaohua Hu, Alfredo Cuzzocrea, Dejing Dou, Dominik Slezak, Wei Wang, Aleksandra Gruca, Jerry Chun-Wei Lin, Rakesh Agrawal
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages3350-3354
Number of pages5
ISBN (Electronic)9798350324457
DOIs
StatePublished - 2023
Event2023 IEEE International Conference on Big Data, BigData 2023 - Sorrento, Italy
Duration: Dec 15 2023Dec 18 2023

Publication series

NameProceedings - 2023 IEEE International Conference on Big Data, BigData 2023

Conference

Conference2023 IEEE International Conference on Big Data, BigData 2023
Country/TerritoryItaly
CitySorrento
Period12/15/2312/18/23

Keywords

  • data lakehouse
  • data streaming
  • database indexing
  • foundation models
  • tensor query
  • tensor streaming

Fingerprint

Dive into the research topics of 'TensorBank: Tensor Lakehouse for Foundation Model Training'. Together they form a unique fingerprint.

Cite this