Rxivist logo

Efficient genotype compression and analysis of large genetic variation datasets

By Ryan M. Layer, Neil Kindlon, Konrad J. Karczewski, Exome Aggregation Consortium, Aaron R. Quinlan

Posted 20 Apr 2015
bioRxiv DOI: 10.1101/018259 (published DOI: 10.1038/nmeth.3654)

The economy of human genome sequencing has catalyzed ambitious efforts to interrogate the genomes of large cohorts in search of new insight into the genetic basis of disease. This manuscript introduces Genotype Query Tools (GQT) as a new indexing strategy and toolset that addresses an analytical bottleneck by enabling interactive analyses based on genotypes, phenotypes and sample relationships. Speed improvements are achieved by operating directly on a compressed genotype index without decompression. GQT?s data compression ratios increase favorably with cohort size and relative analysis performance improves in kind. We demonstrate substantial performance improvements over state-of-theart tools using datasets from the 1000 Genomes Project (46 fold), the Exome Aggregation Consortium (443 fold), and simulated datasets of up to 100,000 genomes (218 fold). Furthermore, we show that this indexing strategy facilitates population and statistical genetics measures such as principal component analysis and burden tests. Based on its computational efficiency and by complementing existing toolsets, GQT provides a flexible framework for current and future analyses of massive genome datasets.

Download data

  • Downloaded 3,230 times
  • Download rankings, all-time:
    • Site-wide: 1,618 out of 89,651
    • In genomics: 329 out of 5,704
  • Year to date:
    • Site-wide: 51,746 out of 89,651
  • Since beginning of last month:
    • Site-wide: 51,635 out of 89,651

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide


Sign up for the Rxivist weekly newsletter! (Click here for more details.)