RecoverY: K-mer based read classification for Y-chromosome specific sequencing and assembly
Robert S. Harris,
Kateryna D. Makova,
Posted 14 Jun 2017
bioRxiv DOI: 10.1101/148114 (published DOI: 10.1093/bioinformatics/btx771)
Posted 14 Jun 2017
Motivation: The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. Since the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to assemble the gorilla Y (Tomaszkiewicz et al 2016). However, the strategy required the manual setting of key parameters, a time-consuming process leading to sub-optimal assemblies. Results: We develop a method, RecoverY, that selects Y-specific reads by automatically choosing the abundance level at which a k-mer is deemed to originate from the Y. This algorithm uses prior knowledge about the Y chromosome of a related species or known Y transcript sequences. We evaluate RecoverY on both simulated and real data, for human and gorilla, and investigate its robustness to important parameters. We show that RecoverY leads to a vastly superior assembly compared to alternate strategies of filtering the reads or contigs. Compared to the preliminary strategy used in Tomaszkiewicz et al (2016), we achieve a 33% improvement in assembly size and a 20% improvement in the NG50, demonstrating the power of automatic parameter selection. Availability: Our tool RecoverY is freely available at https://github.com/makovalab-psu/RecoverY Contact: firstname.lastname@example.org, email@example.com
- Downloaded 667 times
- Download rankings, all-time:
- Site-wide: 25,798 out of 103,809
- In bioinformatics: 3,524 out of 9,474
- Year to date:
- Site-wide: 78,384 out of 103,809
- Since beginning of last month:
- Site-wide: 99,217 out of 103,809
Downloads over time
Distribution of downloads per paper, site-wide
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!