Rxivist logo

Legacy Data Confounds Genomics Studies

By Luke Anderson-Trocmé, Rick Farouni, Mathieu Bourgey, Yoichiro Kamatani, Koichiro Higasa, Jeong-Sun Seo, Changhoon Kim, Fumihiko Matsuda, Simon Gravel

Posted 03 May 2019
bioRxiv DOI: 10.1101/624908 (published DOI: 10.1093/molbev/msz201)

Recent reports have identified differences in the mutational spectra across human populations. While some of these reports have been replicated in other cohorts, most have been reported only in the 1000 Genomes Project (1kGP) data. While investigating an intriguing putative population stratification within the Japanese population, we identified a previously unreported batch effect leading to spurious mutation calls in the 1kGP data and to the apparent population stratification. Because the 1kGP data is used extensively, we find that the batch effects also lead to incorrect imputation by leading imputation servers and a small number of suspicious GWAS associations. Lower-quality data from the early phases of the 1kGP thus continues to contaminate modern studies in hidden ways. It may be time to retire or upgrade such legacy sequencing data.

Download data

  • Downloaded 1,520 times
  • Download rankings, all-time:
    • Site-wide: 11,184
    • In bioinformatics: 1,366
  • Year to date:
    • Site-wide: 60,522
  • Since beginning of last month:
    • Site-wide: 75,179

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide


Sign up for the Rxivist weekly newsletter! (Click here for more details.)