Project Details
Description
SUMMARY
This project will contribute novel deep-learning tools for identification and prioritization of combination of somatic
and germline variants that disrupt the gene-regulatory code and are associated with brain and lung cancers.
While the effect of genetic mutations within the protein-coding regions is well-studied, the same is not true for
those mutations that overlap with the non-coding genomic regions. Non-coding DNA is highly complex due to
the existence of polysemy and distant semantic relationship, from a language modeling perspective. To address
this challenge, Davuluri and Liu groups successfully developed genome foundation models to model DNA as a
language, first on the human genome (Ji, Y. et al. 2021) and more recently on multi-species genomes (Zhou, Z.
et al. 2023). These foundation models, named - DNABERT (code released in 2021) and DNABERT-2 (code
released in 2023), achieved state-of-the-art performance on various sequence prediction tasks, after easy fine-
tuning on small task-specific labeled data. These genome foundation models are based on the idea of
Bidirectional Encoder Representations from Transformers (BERT), which lead the natural language processing
(NLP) research to a new era by introducing a paradigm of pre-training and fine-tuning. DNABERT has been
widely used (over 2 million downloads on HuggingFace, 588 Github stars, 612 citations) in understanding and
decoding genomic and epigenomic languages (for example, Rozowsky et al. Cell, 2023). DNABERT is also
integrated in NVIDIA BioNeMo, a leading drug discovery platform, as the first genomics foundation model.
DNABERT based fine-tuned models can predict allele-specific activity based only on local nucleotide sequence
context, and prioritize candidate transcription-factor-binding sites, core-promoters and splice-sites that are
sensitive to variants at genome-scale. Based on the success of DNABERT foundation models, it is hypothesized
that pre-trained transformer-based neural network models offer a promising, and yet not fully explored, deep
learning approach for identification of combinations of genetic variants that disrupt the regulatory landscape in
the non-coding personal genomes, by applying on the Whole Genome Sequencing (WGS) data. This renewal
project draws on the shared genome foundation models and multi-omic data analyses expertise of pioneers in
the areas of bioinformatics and machine learning and draws from a rich set of established collaborators at Stony
Brook University (SBU) and Northwestern University (NU). Specific aims are: (1) Develop DNABERT fine-tuned
models to specifically target non-coding DNA sequence analyses and predictions; (2) Develop modules to predict
and validate functional non-coding genetic variants by applying the DNABERT prediction models; and (3)
Develop a novel version of DNABERT pre-trained model (DNABERT-s) that specializes in creating species-
aware DNA embeddings with genome foundation models. The proposed algorithms will accelerate the analyses
of non-coding variants from WGS data and identify gene regulatory regions and mutation profiles that can be
used to distinguish between high- and low risk cancers, and analysis of metagenomic data.
| Status | Active |
|---|---|
| Effective start/end date | 08/5/25 → 06/30/26 |
Funding
- National Library of Medicine: $1,272,830.00
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.