Rxivist logo

Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 62,472 bioRxiv papers from 277,337 authors.

Most downloaded bioRxiv papers, all time

in category bioinformatics

6,136 results found. For more information, click each entry to expand.

41: A deep learning approach to pattern recognition for short DNA sequences
more details view paper

Posted to bioRxiv 22 Jun 2018

A deep learning approach to pattern recognition for short DNA sequences
6,633 downloads bioinformatics

Akosua Busia, George E. Dahl, Clara Fannjiang, David H. Alexander, Elizabeth Dorfman, Ryan Poplin, Cory Y. McLean, Pi-Chuan Chang, Mark DePristo

Motivation Inferring properties of biological sequences--such as determining the species-of-origin of a DNA sequence or the function of an amino-acid sequence--is a core task in many bioinformatics applications. These tasks are often solved using string-matching to map query sequences to labeled database sequences or via Hidden Markov Model-like pattern matching. In the current work we describe and assess a deep learning approach which trains a deep neural network (DNN) to predict database-derived labels directly from query sequences. Results We demonstrate this DNN performs at state-of-the-art or above levels on a difficult, practically important problem: predicting species-of-origin from short reads of 16S ribosomal DNA. When trained on 16S sequences of over 13,000 distinct species, our DNN achieves read-level species classification accuracy within 2.0% of perfect memorization of training data, and produces more accurate genus-level assignments for reads from held-out species than k -mer, alignment, and taxonomic binning baselines. Moreover, our models exhibit greater robustness than these existing approaches to increasing noise in the query sequences. Finally, we show that these DNNs perform well on experimental 16S mock community dataset. Overall, our results constitute a first step towards our long-term goal of developing a general-purpose deep learning approach to predicting meaningful labels from short biological sequences. Availability TensorFlow training code is available through GitHub (<https://github.com/tensorflow/models/tree/master/research>). Data in TensorFlow TFRecord format is available on Google Cloud Storage (gs://brain-genomics-public/research/seq2species/). Contact seq2species-interest{at}google.com Supplementary information Supplementary data are available in a separate document.

42: A Comparison of Methods: Normalizing High-Throughput RNA Sequencing Data
more details view paper

Posted to bioRxiv 03 Sep 2015

A Comparison of Methods: Normalizing High-Throughput RNA Sequencing Data
6,623 downloads bioinformatics

Rahul Reddy

As RNA-Seq and other high-throughput sequencing grow in use and remain critical for gene expression studies, technical variability in counts data impedes studies of differential expression studies, data across samples and experiments, or reproducing results. Studies like Dillies et al. (2013) compare several between-lane normalization methods involving scaling factors, while Hansen et al. (2012) and Risso et al. (2014) propose methods that correct for sample-specific bias or use sets of control genes to isolate and remove technical variability. This paper evaluates four normalization methods in terms of reducing intra-group, technical variability and facilitating differential expression analysis or other research where the biological, inter-group variability is of interest. To this end, the four methods were evaluated in differential expression analysis between data from Pickrell et al. (2010) and Montgomery et al. (2010) and between simulated data modeled on these two datasets. Though the between-lane scaling factor methods perform worse on real data sets, they are much stronger for simulated data. We cannot reject the recommendation of Dillies et al. to use TMM and DESeq normalization, but further study of power to detect effects of different size under each normalization method is merited.

43: Critical Assessment of Metagenome Interpretation − a benchmark of computational metagenomics software
more details view paper

Posted to bioRxiv 09 Jan 2017

Critical Assessment of Metagenome Interpretation − a benchmark of computational metagenomics software
6,490 downloads bioinformatics

Alexander Sczyrba, Peter Hofmann, Peter Belmann, David Koslicki, Stefan Janssen, Johannes Dröge, Ivan Gregor, Stephan Majda, Jessika Fiedler, Eik Dahms, Andreas Bremges, Adrian Fritz, Ruben Garrido-Oter, Tue Sparholt Jørgensen, Nicole Shapiro, Philip D Blood, Alexey Gurevich, Yang Bai, Dmitrij Turaev, Matthew Z. DeMaere, Rayan Chikhi, Niranjan Nagarajan, Christopher Quince, Fernando Meyer, Monika Balvoit, Lars Hestbjerg Hansen, Søren J. Sørensen, Burton K H Chia, Bertrand Denis, Jeff L Froula, Zhong Wang, Robert Egan, Dongwan Don Kang, Jeffrey J Cook, Charles Deltel, Michael Beckstette, Claire Lemaitre, Pierre Peterlongo, Guillaume Rizk, Dominique Lavenier, Yu-Wei Wu, Steven W Singer, Chirag Jain, Marc Strous, Heiner Klingenberg, Peter Meinicke, Michael Barton, Thomas Lingner, Hsin-Hung Lin, Yu-Chieh Liao, Genivaldo Gueiros Z. Silva, Daniel A Cuevas, Robert A Edwards, Surya Saha, Vitor C Piro, Bernhard Y Renard, Mihai Pop, Hans-Peter Klenk, Markus Göker, Nikos C. Kyrpides, Tanja Woyke, Julia A Vorholt, Paul Schulze-Lefert, Edward M Rubin, Aaron E Darling, Thomas Rattei, Alice McHardy

In metagenome analysis, computational methods for assembly, taxonomic profiling and binning are key components facilitating downstream biological data interpretation. However, a lack of consensus about benchmarking datasets and evaluation metrics complicates proper performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on datasets of unprecedented complexity and realism. Benchmark metagenomes were generated from ~700 newly sequenced microorganisms and ~600 novel viruses and plasmids, including genomes with varying degrees of relatedness to each other and to publicly available ones and representing common experimental setups. Across all datasets, assembly and genome binning programs performed well for species represented by individual genomes, while performance was substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below the family level. Parameter settings substantially impacted performances, underscoring the importance of program reproducibility. While highlighting current challenges in computational metagenomics, the CAMI results provide a roadmap for software selection to answer specific research questions.

44: DeepGS: Predicting phenotypes from genotypes using Deep Learning
more details view paper

Posted to bioRxiv 31 Dec 2017

DeepGS: Predicting phenotypes from genotypes using Deep Learning
6,446 downloads bioinformatics

Wenlong Ma, Zhixu Qiu, Jie Song, Qian Cheng, Chuang Ma

Motivation: Genomic selection (GS) is a new breeding strategy by which the phenotypes of quantitative traits are usually predicted based on genome-wide markers of genotypes using conventional statistical models. However, the GS prediction models typically make strong assumptions and perform linear regression analysis, limiting their accuracies since they do not capture the complex, non-linear relationships within genotypes, and between genotypes and phenotypes. Results: We present a deep learning method, named DeepGS, to predict phenotypes from geno-types. Using a deep convolutional neural network, DeepGS uses hidden variables that jointly repre-sent features in genotypic markers when making predictions; it also employs convolution, sampling and dropout strategies to reduce the complexity of high-dimensional marker data. We used a large GS dataset to train DeepGS and compare its performance with other methods. In terms of mean normalized discounted cumulative gain value, DeepGS achieves an increase of 27.70%~246.34% over a conventional neural network in selecting top-ranked 1% individuals with high phenotypic values for the eight tested traits. Additionally, compared with the widely used method RR-BLUP, DeepGS still yields a relative improvement ranging from 1.44% to 65.24%. Through extensive simulation experiments, we also demonstrated the effectiveness and robustness of DeepGS for the absent of outlier individuals and subsets of genotypic markers. Finally, we illustrated the complementarity of DeepGS and RR-BLUP with an ensemble learning approach for further improving prediction performance. Availability: DeepGS is provided as an open source R package available at https://github.com/cma2015/DeepGS.

45: Phased Diploid Genome Assembly with Single Molecule Real-Time Sequencing
more details view paper

Posted to bioRxiv 03 Jun 2016

Phased Diploid Genome Assembly with Single Molecule Real-Time Sequencing
6,406 downloads bioinformatics

Chen-Shan Chin, Paul Peluso, Fritz J. Sedlazeck, Maria Nattestad, Gregory T. Concepcion, Alicia Clum, Christopher Dunn, Ronan O'Malley, Rosa Figueroa-Balderas, Abraham Morales-Cruz, Grant R. Cramer, Massimo Delledonne, Chongyuan Luo, Joseph R. Ecker, Dario Cantu, David R Rank, Michael C. Schatz

While genome assembly projects have been successful in a number of haploid or inbred species, one of the current main challenges is assembling non-inbred or rearranged heterozygous genomes. To address this critical need, we introduce the open-source FALCON and FALCON-Unzip algorithms (https://github.com/PacificBiosciences/FALCON/) to assemble Single Molecule Real-Time (SMRT(R)) Sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. We demonstrate the quality of this approach by assembling new reference sequences for three heterozygous samples, including an F1 hybrid of the model species Arabidopsis thaliana, the widely cultivated V. vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata that have challenged short-read assembly approaches. The FALCON-based assemblies were substantially more contiguous and complete than alternate short or long-read approaches. The phased diploid assembly enabled the study of haplotype structures and heterozygosities between the homologous chromosomes, including identifying widespread heterozygous structural variations within the coding sequences.

46: Mash: fast genome and metagenome distance estimation using MinHash
more details view paper

Posted to bioRxiv 26 Oct 2015

Mash: fast genome and metagenome distance estimation using MinHash
6,359 downloads bioinformatics

Brian D. Ondov, Todd J. Treangen, Páll Melsted, Adam B. Mallonee, Nicholas H. Bergman, Sergey Koren, Adam M Phillippy

Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P-value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU hours; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. Mash is freely released under a BSD license (https://github.com/marbl/mash).

47: Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the mega-reads algorithm
more details view paper

Posted to bioRxiv 26 Jul 2016

Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the mega-reads algorithm
6,160 downloads bioinformatics

Aleksey V. Zimin, Daniela Puiu, Ming-Cheng Luo, Tingting Zhu, Sergey Koren, James A. Yorke, Jan Dvorak, Steven L Salzberg

Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and highly repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.

48: FMRIPrep: a robust preprocessing pipeline for functional MRI
more details view paper

Posted to bioRxiv 25 Apr 2018

FMRIPrep: a robust preprocessing pipeline for functional MRI
6,142 downloads bioinformatics

Oscar Esteban, Christopher Markiewicz, Ross W Blair, Craig Moodie, Ayse Ilkay Isik, Asier Erramuzpe Aliaga, James Kent, Mathias Goncalves, Elizabeth DuPre, Madeleine Snyder, Hiroyuki Oya, Satrajit Ghosh, Jessey Wright, Joke Durnez, Russell Poldrack, Krzysztof Gorgolewski

Preprocessing of functional MRI (fMRI) involves numerous steps to clean and standardize data before statistical analysis. Generally, researchers create ad hoc preprocessing workflows for each new dataset, building upon a large inventory of tools available for each step. The complexity of these workflows has snowballed with rapid advances in MR data acquisition and image processing techniques. We introduce fMRIPrep, an analysis-agnostic tool that addresses the challenge of robust and reproducible preprocessing for task-based and resting fMRI data. FMRIPrep automatically adapts a best-in-breed workflow to the idiosyncrasies of virtually any dataset, ensuring high-quality preprocessing with no manual intervention. By introducing visual assessment checkpoints into an iterative integration framework for software-testing, we show that fMRIPrep robustly produces high-quality results on a diverse fMRI data collection comprising participants from 54 different studies in the OpenfMRI repository. We review the distinctive features of fMRIPrep in a qualitative comparison to other preprocessing workflows. We demonstrate that fMRIPrep achieves higher spatial accuracy as it introduces less uncontrolled spatial smoothness than commonly used preprocessing tools. FMRIPrep has the potential to transform fMRI research by equipping neuroscientists with a high-quality, robust, easy-to-use and transparent preprocessing workflow which can help ensure the validity of inference and the interpretability of their results.

49: Text mining of 15 million full-text scientific articles
more details view paper

Posted to bioRxiv 11 Jul 2017

Text mining of 15 million full-text scientific articles
6,133 downloads bioinformatics

David Westergaard, Hans-Henrik Stærfeldt, Christian Tønsberg, Lars Juhl Jensen, Søren Brunak

Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.

50: Strain-resolved microbiome sequencing reveals mobile elements that drive bacterial competition on a clinical timescale
more details view paper

Posted to bioRxiv 07 Apr 2017

Strain-resolved microbiome sequencing reveals mobile elements that drive bacterial competition on a clinical timescale
6,113 downloads bioinformatics

Alex Bishara, Eli L Moss, Ekaterina Tkachenko, Joyce B Kang, Soumaya Zlitni, Rebecca N Culver, Tessa M. Andermann, Ziming Weng, Christina Wood, Christine Handy, Hanlee Ji, Serafim Batzoglou, Ami S Bhatt

Although shotgun short-read sequencing has facilitated the study of strain-level architecture within complex microbial communities, existing metagenomic approaches often cannot capture structural differences between closely related co-occurring strains. Recent methods, which employ read cloud sequencing and specialized assembly techniques, provide significantly improved genome drafts and show potential to capture these strain-level differences. Here, we apply this read cloud metagenomic approach to longitudinal stool samples from a patient undergoing hematopoietic cell transplantation. The patient's microbiome is profoundly disrupted and is eventually dominated by Bacteroides caccae. Comparative analysis of B. caccae genomes obtained using read cloud sequencing together with metagenomic RNA sequencing allows us to predict that particular mobile element integrations result in increased antibiotic resistance, which we further support using in vitro antibiotic susceptibility testing. Thus, we find read cloud sequencing to be useful in identifying strain-level differences that underlie differential fitness.

51: Streaming algorithms for identification of pathogens and antibiotic resistance potential from real-time MinION sequencing
more details view paper

Posted to bioRxiv 15 May 2015

Streaming algorithms for identification of pathogens and antibiotic resistance potential from real-time MinION sequencing
6,014 downloads bioinformatics

Minh Duc Cao, Devika Ganesamoorthy, Alysha G Elliott, Huihui Zhang, Matthew A. Cooper, Lachlan Coin

The recently introduced Oxford Nanopore MinION platform generates DNA sequence data in real-time. This opens immense potential to shorten the sample-to-results time and is likely to lead to enormous benefits in rapid diagnosis of bacterial infection and identification of drug resistance. However, there are very few tools available for streaming analysis of real-time sequencing data. Here, we present a framework for streaming analysis of MinION real-time sequence data, together with probabilistic streaming algorithms for species typing, multi-locus strain typing, gene presence strain-typing and antibiotic resistance profile identification. Using three culture isolate samples as well as a mixed-species sample, we demonstrate that bacterial species and strain information can be obtained within 30 minutes of sequencing and using about 500 reads, initial drug-resistance profiles within two hours, and complete resistance profiles within 10 hours. Multi-locus strain typing required more than 15x coverage to generate confident assignments, whereas gene-presence typing could detect the presence of a known strain with 0.5x coverage. We also show that our pipeline can process over 100 times more data than the current throughput of the MinION on a desktop computer.

52: Identification of transcriptional signatures for cell types from single-cell RNA-Seq
more details view paper

Posted to bioRxiv 01 Feb 2018

Identification of transcriptional signatures for cell types from single-cell RNA-Seq
5,974 downloads bioinformatics

Vasilis Ntranos, Lynn Yi, Páll Melsted, Lior Pachter

Single-cell RNA-Seq makes it possible to characterize the transcriptomes of cell types and identify their transcriptional signatures via differential analysis. We present a fast and accurate method for discriminating cell types that takes advantage of the large numbers of cells that are assayed. When applied to transcript compatibility counts obtained via pseudoalignment, our approach provides a quantification-free analysis of 3' single-cell RNA-Seq that can identify previously undetectable marker genes.

53: Interaction of quercetin with transcriptional regulator LasR of Pseudomonas aeruginosa: Mechanistic insights of the inhibition of virulence through quorum sensing
more details view paper

Posted to bioRxiv 27 Dec 2017

Interaction of quercetin with transcriptional regulator LasR of Pseudomonas aeruginosa: Mechanistic insights of the inhibition of virulence through quorum sensing
5,967 downloads bioinformatics

Hovakim Grabski, Lernik Hunanyan, Susanna Tiratsuyan, Hrachik Vardapetyan

Pseudomonas aeruginosa is one of the most dangerous superbugs in the list of bacteria for which new antibiotics are urgently needed, which was published by World Health Organization. P. aeruginosa is an antibiotic-resistant opportunistic human pathogen. It affects patients with AIDS, cystic fibrosis, cancer, burn victims and people with prosthetics and implants. P. aeruginosa also forms biofilms. Biofilms increase resistance to antibiotics and host immune responses. Because of biofilms, current therapies are not effective. It is important to find new antibacterial treatment strategies against P. aeruginosa. Biofilm formation is regulated through a system called quorum sensing. Thus disrupting this system is considered a promising strategy to combat bacterial pathogenicity. It is known that quercetin inhibits Pseudomonas aeruginosa biofilm formation, but the mechanism of action is unknown. In the present study, we tried to analyse the mode of interactions of LasR with quercetin. We used a combination of molecular docking, molecular dynamics (MD) simulations and machine learning techniques for the study of the interaction of the LasR protein of P. aeruginosa with quercetin. We assessed the conformational changes of the interaction and analysed the molecular details of the binding of quercetin with LasR. We show that quercetin has two binding modes. One binding mode is the interaction with ligand binding domain, this interaction is not competitive and it has also been shown experimentally. The second binding mode is the interaction with the bridge, it involves conservative amino acid interactions from LBD, SLR, and DBD and it is also not competitive. Experimental studies show hydroxyl group of ring A is necessary for inhibitory activity, in our model the hydroxyl group interacts with Leu177 during the second binding mode. This could explain the molecular mechanism of how quercetin inhibits LasR protein. This study may offer insights on how quercetin inhibits quorum sensing circuitry by interacting with transcriptional regulator LasR. The capability of having two binding modes may explain why quercetin is effective at inhibiting biofilm formation and virulence gene expression.

54: Integrating Hi-C links with assembly graphs for chromosome-scale assembly
more details view paper

Posted to bioRxiv 07 Feb 2018

Integrating Hi-C links with assembly graphs for chromosome-scale assembly
5,950 downloads bioinformatics

Jay Ghurye, Arang Rhie, Brian P. Walenz, Anthony Schmitt, Siddarth Selvaraj, Mihai Pop, Adam M Phillippy, Sergey Koren

Motivation: Long-read sequencing and novel long-range assays have revolutionized de novo genome assembly by automating the reconstruction of reference-quality genomes. In particular, Hi-C sequencing is becoming an economical method for generating chromosome-scale scaffolds. Despite its increasing popularity, there are limited open-source tools available. Errors, particularly inversions and fusions across chromosomes, remain higher than alternate scaffolding technologies. Results: We present a novel open-source Hi-C scaffolder that does not require an a priori estimate of chromosome number and minimizes errors by scaffolding with the assistance of an assembly graph. We demonstrate higher accuracy than the state-of-the-art methods across a variety of Hi-C library preparations and input assembly sizes. Availability and Implementation: The Python and C++ code for our method is openly available at https://github.com/machinegun/SALSA.

55: Graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells
more details view paper

Posted to bioRxiv 25 Oct 2017

Graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells
5,949 downloads bioinformatics

F. Alexander Wolf, Fiona Hamey, Mireya Plass, Jordi Solana, Joakim S. Dahlin, Berthold Göttgens, Nikolaus Rajewsky, Lukas Simon, Fabian J. Theis

Single-cell RNA-seq allows quantification of biological heterogeneity across both discrete cell types and continuous cell differentiation transitions. We present approximate graph abstraction (AGA), an algorithm that reconciles the computational analysis strategies of clustering and trajectory inference by explaining cell-to-cell variation both in terms of discrete and continuous latent variables (https://github.com/theislab/graph_abstraction). This enables to generate cellular maps of differentiation manifolds with complex topologies --- efficiently and robustly across different datasets. Approximate graph abstraction quantifies the connectivity of partitions of a neighborhood graph of single cells, thereby generating a much simpler abstracted graph whose nodes label the partitions. Together with a random walk-based distance measure, this generates a topology preserving map of single cells --- a partial coordinatization of data useful for exploring and explaining its variation. We use the abstracted graph to assess which subsets of data are better explained by discrete clusters than by a continuous variable, to trace gene expression changes along aggregated single-cell paths through data and to infer abstracted trees that best explain the global topology of data. We demonstrate the power of the method by reconstructing differentiation processes with high numbers of branchings from single-cell gene expression datasets and by identifying biological trajectories from single-cell imaging data using a deep-learning based distance metric. Along with the method, we introduce measures for the connectivity of graph partitions, generalize random-walk based distance measures to disconnected graphs and introduce a path-based measure for topological similarity between graphs. Graph abstraction is computationally efficient and provides speedups of at least 30 times when compared to algorithms for the inference of lineage trees.

56: Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines
more details view paper

Posted to bioRxiv 02 Aug 2015

Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines
5,928 downloads bioinformatics

John G Cleary, Ross Braithwaite, Kurt Gaastra, Brian S Hilbush, Stuart Inglis, Sean A Irvine, Alan Jackson, Richard Littin, Mehul Rathod, David Ware, Justin M Zook, Len Trigg, Francisco M. De La Vega

To evaluate and compare the performance of variant calling methods and their confidence scores, comparisons between a test call set and a ?gold standard? need to be carried out. Unfortunately, these comparisons are not straightforward with the current Variant Call Files (VCF), which are the standard output of most variant calling algorithms for high-throughput sequencing data. Comparisons of VCFs are often confounded by the different representations of indels, MNPs, and combinations thereof with SNVs in complex regions of the genome, resulting in misleading results. A variant caller is inherently a classification method designed to score putative variants with confidence scores that could permit controlling the rate of false positives (FP) or false negatives (FN) for a given application. Receiver operator curves (ROC) and the area under the ROC (AUC) are efficient metrics to evaluate a test call set versus a gold standard. However, in the case of VCF data this also requires a special accounting to deal with discrepant representations. We developed a novel algorithm for comparing variant call sets that deals with complex call representation discrepancies and through a dynamic programing method that minimizes false positives and negatives globally across the entire call sets for accurate performance evaluation of VCFs.

57: STAR-Fusion: Fast and Accurate Fusion Transcript Detection from RNA-Seq
more details view paper

Posted to bioRxiv 24 Mar 2017

STAR-Fusion: Fast and Accurate Fusion Transcript Detection from RNA-Seq
5,882 downloads bioinformatics

Brian Haas, Alex Dobin, Nicolas Stransky, Bo Li, Xiao Yang, Timothy Tickle, Asma Bankapur, Carrie Ganote, Thomas G. Doak, Nathalie Pochet, Jing Sun, Catherine J Wu, Thomas R Gingeras, Aviv Regev

Motivation: Fusion genes created by genomic rearrangements can be potent drivers of tumorigenesis. However, accurate identification of functionally fusion genes from genomic sequencing requires whole genome sequencing, since exonic sequencing alone is often insufficient. Transcriptome sequencing provides a direct, highly effective alternative for capturing molecular evidence of expressed fusions in the precision medicine pipeline, but current methods tend to be inefficient or insufficiently accurate, lacking in sensitivity or predicting large numbers of false positives. Here, we describe STAR-Fusion, a method that is both fast and accurate in identifying fusion transcripts from RNA-Seq data. Results: We benchmarked STAR-Fusion's fusion detection accuracy using both simulated and genuine Illumina paired-end RNA-Seq data, and show that it has superior performance compared to popular alternative fusion detection methods. Availability and implementation: STAR-Fusion is implemented in Perl, freely available as open source software at http://star-fusion.github.io, and supported on Linux.

58: Accurate prediction of single-cell DNA methylation states using deep learning
more details view paper

Posted to bioRxiv 27 May 2016

Accurate prediction of single-cell DNA methylation states using deep learning
5,855 downloads bioinformatics

Christof Angermueller, Heather Lee, Wolf Reik, Oliver Stegle

Recent technological advances have enabled assaying DNA methylation at single-cell resolution. Current protocols are limited by incomplete CpG coverage and hence methods to predict missing methylation states are critical to enable genome-wide analyses. Here, we report DeepCpG, a computational approach based on deep neural networks to predict DNA methylation states from DNA sequence and incomplete methylation profiles in single cells. We evaluated DeepCpG on single-cell methylation data from five cell types generated using alternative sequencing protocols, finding that DeepCpG yields substantially more accurate predictions than previous methods. Additionally, we show that the parameters of our model can be interpreted, thereby providing insights into the effect of sequence composition on methylation variability.

59: Rapid and efficient analysis of 20,000 RNA-seq samples with Toil
more details view paper

Posted to bioRxiv 07 Jul 2016

Rapid and efficient analysis of 20,000 RNA-seq samples with Toil
5,741 downloads bioinformatics

John Vivian, Arjun Rao, Frank Austin Nothaft, Christopher Ketchum, Joel Armstrong, Adam Novak, Jacob Pfeil, Jake Narkizian, Alden D. Deran, Audrey Musselman-Brown, Hannes Schmidt, Peter Amstutz, Brian Craft, Mary Goldman, Kate Rosenbloom, Melissa Cline, Brian O’Connor, Megan Hanna, Chet Birger, W. James Kent, David A. Patterson, Anthony D. Joseph, Jingchun Zhu, Sasha Zaranek, Gad Getz, David Haussler, Benedict Paten

Toil is portable, open-source workflow software that supports contemporary workflow definition languages and can be used to securely and reproducibly run scientific workflows efficiently at large-scale. To demonstrate Toil, we processed over 20,000 RNA-seq samples to create a consistent meta-analysis of five datasets free of computational batch effects that we make freely available. Nearly all the samples were analysed in under four days using a commercial cloud cluster of 32,000 preemptable cores.

60: Deep Learning based multi-omics integration robustly predicts survival in liver cancer
more details view paper

Posted to bioRxiv 08 Mar 2017

Deep Learning based multi-omics integration robustly predicts survival in liver cancer
5,647 downloads bioinformatics

Kumardeep Chaudhary, Olivier B. Poirion, Liangqun Lu, Lana X. Garmire

Identifying robust survival subgroups of hepatocellular carcinoma (HCC) will significantly improve patient care. Currently, endeavor of integrating multi-omics data to explicitly predict HCC survival from multiple patient cohorts is lacking. To fill in this gap, we present a deep learning (DL) based model on HCC that robustly differentiates survival subpopulations of patients in six cohorts. We build the DL based, survival-sensitive model on 360 HCC patients' data using RNA-seq, miRNA-seq and methylation data from TCGA, which predicts prognosis as good as an alternative model where genomics and clinical data are both considered. This DL based model provides two optimal subgroups of patients with significant survival differences (P=7.13e-6) and good model fitness (C-index=0.68). More aggressive subtype is associated with frequent TP53 inactivation mutations, higher expression of stemness markers (KRT19, EPCAM) and tumor marker BIRC5, and activated Wnt and Akt signaling pathways. We validated this multi-omics model on five external datasets of various omics types: LIRI-JP cohort (n=230, C-index=0.75), NCI cohort (n=221, C-index=0.67), Chinese cohort (n=166, C-index=0.69), E-TABM-36 cohort (n=40, C-index=0.77), and Hawaiian cohort (n=27, C-index=0.82). This is the first study to employ deep learning to identify multi-omics features linked to the differential survival of HCC patients. Given its robustness over multiple cohorts, we expect this workflow to be useful at predicting HCC prognosis prediction.

Previous page 1 2 3 4 5 6 7 . . . 307 Next page

Sign up for the Rxivist weekly newsletter! (Click here for more details.)


News