Skip to main navigation Skip to search Skip to main content

Critical evaluation of the use of artificial data for machine learning based de novo peptide identification

  • University of Galway

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

Proteins are essential components of all living cells and so the study of their in situ expression, proteomics, has wide reaching applications. Peptide identification in proteomics typically relies on matching high resolution tandem mass spectra to a protein database but can also be performed de novo. While artificial spectra have been successfully incorporated into database search pipelines to increase peptide identification rates, little work has been done to investigate the utility of artificial spectra in the context of de novo peptide identification. Here, we perform a critical analysis of the use of artificial data for the training and evaluation of de novo peptide identification algorithms. First, we classify the different fragment ion types present in real spectra and then estimate the number of spurious matches using random peptides. We then categorise the different types of noise present in real spectra. Finally, we transfer this knowledge to artificial data and test the performance of a state-of-the-art de novo peptide identification algorithm trained using artificial spectra with and without relevant noise addition. Noise supplementation increased artificial training data performance from 30% to 77% of real training data peptide recall. While real data performance was not fully replicated, this work provides the first steps towards an artificial spectrum framework for the training and evaluation of de novo peptide identification algorithms. Further enhanced artificial spectra may allow for more in depth analysis of de novo algorithms as well as alleviating the reliance on database searches for training data.

Original languageEnglish
Pages (from-to)2732-2743
Number of pages12
JournalComputational and Structural Biotechnology Journal
Volume21
DOIs
StatePublished - Jan 2023

Keywords

  • Artificial data
  • Noise
  • Peptide sequencing
  • Synthetic data
  • de novo

Fingerprint

Dive into the research topics of 'Critical evaluation of the use of artificial data for machine learning based de novo peptide identification'. Together they form a unique fingerprint.

Cite this