Skip to main navigation Skip to search Skip to main content

Developing Novel Deep-Learning Based Methods for Deciphering Non-Coding Gene Regulatory Code

Project: Research

Project Details

Description

SUMMARY This project will contribute novel deep-learning tools for identification and prioritization of combination of somatic and germline variants that disrupt the gene-regulatory code and are associated with brain and lung cancers. While the effect of genetic mutations within the protein-coding regions is well-studied, the same is not true for those mutations that overlap with the non-coding genomic regions. Non-coding DNA is highly complex due to the existence of polysemy and distant semantic relationship, from a language modeling perspective. To address this challenge, Davuluri and Liu groups successfully developed genome foundation models to model DNA as a language, first on the human genome (Ji, Y. et al. 2021) and more recently on multi-species genomes (Zhou, Z. et al. 2023). These foundation models, named - DNABERT (code released in 2021) and DNABERT-2 (code released in 2023), achieved state-of-the-art performance on various sequence prediction tasks, after easy fine- tuning on small task-specific labeled data. These genome foundation models are based on the idea of Bidirectional Encoder Representations from Transformers (BERT), which lead the natural language processing (NLP) research to a new era by introducing a paradigm of pre-training and fine-tuning. DNABERT has been widely used (over 2 million downloads on HuggingFace, 588 Github stars, 612 citations) in understanding and decoding genomic and epigenomic languages (for example, Rozowsky et al. Cell, 2023). DNABERT is also integrated in NVIDIA BioNeMo, a leading drug discovery platform, as the first genomics foundation model. DNABERT based fine-tuned models can predict allele-specific activity based only on local nucleotide sequence context, and prioritize candidate transcription-factor-binding sites, core-promoters and splice-sites that are sensitive to variants at genome-scale. Based on the success of DNABERT foundation models, it is hypothesized that pre-trained transformer-based neural network models offer a promising, and yet not fully explored, deep learning approach for identification of combinations of genetic variants that disrupt the regulatory landscape in the non-coding personal genomes, by applying on the Whole Genome Sequencing (WGS) data. This renewal project draws on the shared genome foundation models and multi-omic data analyses expertise of pioneers in the areas of bioinformatics and machine learning and draws from a rich set of established collaborators at Stony Brook University (SBU) and Northwestern University (NU). Specific aims are: (1) Develop DNABERT fine-tuned models to specifically target non-coding DNA sequence analyses and predictions; (2) Develop modules to predict and validate functional non-coding genetic variants by applying the DNABERT prediction models; and (3) Develop a novel version of DNABERT pre-trained model (DNABERT-s) that specializes in creating species- aware DNA embeddings with genome foundation models. The proposed algorithms will accelerate the analyses of non-coding variants from WGS data and identify gene regulatory regions and mutation profiles that can be used to distinguish between high- and low risk cancers, and analysis of metagenomic data.
StatusActive
Effective start/end date08/5/2506/30/26

Funding

  • National Library of Medicine: $1,272,830.00

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.