Rxivist logo

Polygenic scores for UK Biobank scale data

By Timothy Shin Heng Mak, Robert M. Porsch, Shing Wan Choi, Pak C Sham

Posted 23 Jan 2018
bioRxiv DOI: 10.1101/252270

Polygenic scores (PGS) are estimated scores representing the genetic tendency of an individual for a disease or trait and have become an indispensible tool in a variety of analyses. Typically they are linear combination of the genotypes of a large number of SNPs, with the weights calculated from an external source, such as summary statistics from large meta-analyses. Recently cohorts with genetic data have become very large, such that it would be a waste if the raw data were not made use of constructing PGS. Making use of raw data in calculating PGS, however, presents us with problems of overfitting. Here we discuss the essence of overfitting as applied to PGS calculations and highlight the difference between overfitting due to the overlap between the target and the discovery data (OTD), and overfitting due to the overlap between the target the the validation data (OTV). We propose two methods -- cross prediction and split validation -- to overcome OTD and OTV respectively. Using these two methods, PGS can be calculated using raw data without overfitting. We show that PGSs thus calculated have better predictive power than those using summary statistics alone for six phenotypes in the UK Biobank data.

Download data

  • Downloaded 2,618 times
  • Download rankings, all-time:
    • Site-wide: 4,999
    • In genomics: 563
  • Year to date:
    • Site-wide: 17,121
  • Since beginning of last month:
    • Site-wide: 19,845

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide


Sign up for the Rxivist weekly newsletter! (Click here for more details.)