Skip to main navigation Skip to search Skip to main content

DNABERT-S: Pioneering species differentiation with species-aware DNA embeddings

  • Zhihan Zhou
  • , Weimin Wu
  • , Harrison Ho
  • , Jiayi Wang
  • , Lizhen Shi
  • , Ramana V. Davuluri
  • , Zhong Wang
  • , Han Liu
  • Northwestern University
  • University of California Merced
  • United States Department of Energy
  • Lawrence Berkeley National Laboratory

Research output: Contribution to journalArticlepeer-review

17 Scopus citations

Abstract

Summary We introduce DNABERT-S, a tailored genome model that develops species-aware embeddings to naturally cluster and segregate DNA sequences of different species in the embedding space. Differentiating species from genomic sequences (i.e. DNA and RNA) is vital yet challenging, since many real-world species remain uncharacterized, lacking known genomes for reference. Embedding-based methods are therefore used to differentiate species in an unsupervised manner. DNABERT-S builds upon a pre-trained genome foundation model named DNABERT-2. To encourage effective embeddings to error-prone long-read DNA sequences, we introduce Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. We further enhance it with the proposed Curriculum Contrastive Learning (C2LR) strategy. Empirical results on 28 diverse datasets show DNABERT-S's effectiveness, especially in realistic label-scarce scenarios. For example, it identifies twice more species from a mixture of unlabeled genomic sequences, doubles the Adjusted Rand Index (ARI) in species clustering, and outperforms the top baseline's performance in 10-shot species classification with just a 2-shot training.

Original languageEnglish
Pages (from-to)i255-i264
JournalBioinformatics
Volume41
DOIs
StatePublished - Jul 1 2025

Fingerprint

Dive into the research topics of 'DNABERT-S: Pioneering species differentiation with species-aware DNA embeddings'. Together they form a unique fingerprint.

Cite this