Skip to main navigation Skip to search Skip to main content

High performance merging of massive data from genome-wide association studies

  • Emory University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The traditional data processing methods working on single computer show less scalability and efficiency for performing ordered full-outer-joining, on merging large number of individual Genome-Wide Associations Studies (GWAS) data. Although the emerging of big data platforms such as Hadoop and Spark shed lights on this problem, the inefficiency of keeping data in total-sorted order as well as the workload imbalance problem limit their performance. In this study, we designed and compared three new methodologies based on MapReduce, HBase and Spark respectively, to merge hundreds of individuals VCF files on their Single Nucleotide Polymorphism (SNP) location into a single TPED file. Our methodologies overcame the limitations stated above and considerably improved the performance with good scalability on input size and computing resources.

Original languageEnglish
Title of host publicationData Management and Analytics for Medicine and Healthcare - 3rd International Workshop, DMAH 2017 Held at VLDB 2017, Proceedings
EditorsEdmon Begoli, Gang Luo, Fusheng Wang
PublisherSpringer Verlag
Pages36-40
Number of pages5
ISBN (Print)9783319671857
DOIs
StatePublished - 2017
Event3rd International Workshop on Data Management and Analytics for Medicine and Healthcare, DMAH 2017 held in conjunction with the 43rd International Conference on Very Large Data Bases, VLDB 2017 - Munich, Germany
Duration: Sep 1 2017Sep 1 2017

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10494 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference3rd International Workshop on Data Management and Analytics for Medicine and Healthcare, DMAH 2017 held in conjunction with the 43rd International Conference on Very Large Data Bases, VLDB 2017
Country/TerritoryGermany
CityMunich
Period09/1/1709/1/17

Keywords

  • Genome-Wide Association Studies (GWAS)
  • HBase
  • MapReduce
  • Scalability
  • Spark
  • Total order full-outer-merging
  • TPED
  • Variant Call Format (VCF)

Fingerprint

Dive into the research topics of 'High performance merging of massive data from genome-wide association studies'. Together they form a unique fingerprint.

Cite this