Background Currently quantitative RNA-Seq methods are pushed to work with increasingly small starting amounts of RNA that require PCR amplification to generate libraries. However, it is unclear how much noise or bias amplification introduces and how this effects precision and accuracy of RNA quantification. To assess the effects of amplification, reads that originated from the same RNA molecule (PCR-duplicates) need to be identified. Computationally, read duplicates are defined via their mapping position, which does not distinguish PCR- from natural duplicates that are bound to occur for highly transcribed RNAs. Hence, it is unclear how to treat duplicate reads and how important it is to reduce PCR amplification experimentally. Here, we generate and analyse RNA-Seq datasets that were prepared with three different protocols (Smart-Seq, TruSeq and UMI-seq). We find that a large fraction of computationally identified read duplicates can be explained by sampling and fragmentation bias. Consequently, the computational removal of duplicates does not improve accuracy, power or false discovery rates, but can actually worsen them. Even when duplicates are experimentally identified by unique molecular identifiers (UMIs), power and false discovery rate are only mildly improved. However, we do find that power does improve with fewer PCR amplification cycles across datasets and that early barcoding of samples and hence PCR amplification in one reaction can restore this loss of power. Conclusions Computational removal of read duplicates is not recommended for differential expression analysis. However, the pooling of samples as made possible by the early barcoding of the UMI-protocol leads to an appreciable increase in the power to detect differentially expressed genes.
- Downloaded 2,779 times
- Download rankings, all-time:
- Site-wide: 1,618 out of 76,880
- In genomics: 348 out of 5,050
- Year to date:
- Site-wide: 31,556 out of 76,880
- Since beginning of last month:
- Site-wide: 25,981 out of 76,880
Downloads over time
Distribution of downloads per paper, site-wide
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!