Rxivist logo

Rxivist.org combines preprints from bioRxiv.org with data from Twitter to help you find the papers being discussed in your field.
Currently indexing 62,945 bioRxiv papers from 279,208 authors.

Most downloaded bioRxiv papers, all time

Results 1 through 20 out of 6152

in category bioinformatics


1: Opportunities And Obstacles For Deep Learning In Biology And Medicine

Travers Ching, Daniel S Himmelstein et al.

50,811 downloads (posted 28 May 2017)

Deep learning, which describes a class of machine learning algorithms, has recently showed impressive results across a variety of domains. Biology and medicine are data rich, but the data are complex and often ill-understood. Problems of this nature may be particularly well-suited to deep learning techniques. We examine applications of deep learning to a variety of biomedical problems - patient classification, fundamental biological processes, and treatment of patients - and discuss whether deep learning will transform these tasks or if the biomedical sphere poses unique challenges. We find that deep learning has yet to revolutionize or definitively resolve any of these problems, but promising advances have been made on the prior state of the art. Even when improvement over a previous baseline has been modest, we have seen signs that deep learning methods may speed or aid human investigation. More work is needed to address concerns related to interpretability and how to best model each problem. Furthermore, the limited amount of labeled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning powering changes at both bench and bedside with the potential to transform several areas of biology and medicine.


2: Third-generation sequencing and the future of genomics

Hayan Lee, James Gurtowski et al.

28,024 downloads (posted 13 Apr 2016)

Third-generation long-range DNA sequencing and mapping technologies are creating a renaissance in high-quality genome sequencing. Unlike second-generation sequencing, which produces short reads a few hundred base-pairs long, third-generation single-molecule technologies generate over 10,000 bp reads or map over 100,000 bp molecules. We analyze how increased read lengths can be used to address long-standing problems in de novo genome assembly, structural variation analysis and haplotype phasing.


3: Salmon provides accurate, fast, and bias-aware transcript expression estimates using dual-phase inference

Rob Patro, Geet Duggal et al.

18,758 downloads (posted 27 Jun 2015)

We introduce Salmon, a new method for quantifying transcript abundance from RNA-seq reads that is highly-accurate and very fast. Salmon is the first transcriptome-wide quantifier to model and correct for fragment GC content bias, which we demonstrate substantially improves the accuracy of abundance estimates and the reliability of subsequent differential expression analysis compared to existing methods that do not account for these biases. Salmon achieves its speed and accuracy by combining a new dual-phase parallel inf...


4: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Michael I Love, Wolfgang Huber et al.

17,900 downloads (posted 19 Feb 2014)

In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq data, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data. DESeq2 uses shrinkage estimation for dispersions and fold changes to improve stability and interpretability of the...


5: Moving beyond P values: Everyday data analysis with estimation plots

Joses Ho, Tayfun Tumkaya et al.

16,169 downloads (posted 26 Jul 2018)

Over the past 75 years, a number of statisticians have advised that the data-analysis method known as null-hypothesis significance testing (NHST) should be deprecated (Berkson, 1942; Halsey et al., 2015). The limitations of NHST have been extensively discussed, with an emerging consensus that current statistical practice in the biological sciences needs reform. However, there is less agreement on the specific nature of reform, with vigorous debate surrounding what would constitute a suitable alternative (Altman et al., ...


6: Content-Aware Image Restoration: Pushing the Limits of Fluorescence Microscopy

Martin Weigert, Deborah Schmidt et al.

15,387 downloads (posted 19 Dec 2017)

Fluorescence microscopy is a key driver of discoveries in the life-sciences, with observable phenomena being limited by the optics of the microscope, the chemistry of the fluorophores, and the maximum photon exposure tolerated by the sample. These limits necessitate trade-offs between imaging speed, spatial resolution, light exposure, and imaging depth. In this work we show how image restoration based on deep learning extends the range of biological phenomena observable by microscopy. On seven concrete examples we demon...


7: End-to-end differentiable learning of protein structure

Mohammed AlQuraishi

14,993 downloads (posted 14 Feb 2018)

Accurate prediction of protein structure is one of the central challenges of biochemistry. Despite significant progress made by co-evolution methods to predict protein structure from signatures of residue-residue coupling found in the evolutionary record, a direct and explicit mapping between protein sequence and structure remains elusive, with no substantial recent progress. Meanwhile, rapid developments in deep learning, which have found remarkable success in computer vision, natural language processing, and quantum c...


8: Evaluation of UMAP as an alternative to t-SNE for single-cell data

Etienne Becht, Charles-Antoine Dutertre et al.

13,730 downloads (posted 10 Apr 2018)

Uniform Manifold Approximation and Projection (UMAP) is a recently-published non-linear dimensionality reduction technique. Another such algorithm, t-SNE, has been the default method for such task in the past years. Herein we comment on the usefulness of UMAP high-dimensional cytometry and single-cell RNA sequencing, notably highlighting faster runtime and consistency, meaningful organization of cell clusters and preservation of continuums in UMAP compared to t-SNE.


9: Flexible analysis of transcriptome assemblies with Ballgown

Alyssa C Frazee, Geo Pertea et al.

13,360 downloads (posted 30 Mar 2014)

We have built a statistical package called Ballgown for estimating differential expression of genes, transcripts, or exons from RNA sequencing experiments. Ballgown is designed to work with the popular Cufflinks transcript assembly software and uses well-motivated statistical methods to provide estimates of changes in expression. It permits statistical analysis at the transcript level for a wide variety of experimental designs, allows adjustment for confounders, and handles studies with continuous covariates. Ballgown p...


10: A comparison of single-cell trajectory inference methods: towards more accurate and robust tools

Wouter Saelens, Robrecht Cannoodt et al.

13,078 downloads (posted 05 Mar 2018)

Using single-cell -omics data, it is now possible to computationally order cells along trajectories, allowing the unbiased study of cellular dynamic processes. Since 2014, more than 50 trajectory inference methods have been developed, each with its own set of methodological characteristics. As a result, choosing a method to infer trajectories is often challenging, since a comprehensive assessment of the performance and robustness of each method is still lacking. In order to facilitate the comparison of the results of th...


11: MAGIC: A diffusion-based imputation method reveals gene-gene interactions in single-cell RNA-sequencing data

David van Dijk, Juozas Nainys et al.

13,050 downloads (posted 25 Feb 2017)

Single-cell RNA-sequencing is fast becoming a major technology that is revolutionizing biological discovery in fields such as development, immunology and cancer. The ability to simultaneously measure thousands of genes at single cell resolution allows, among other prospects, for the possibility of learning gene regulatory networks at large scales. However, scRNA-seq technologies suffer from many sources of significant technical noise, the most prominent of which is dropout due to inefficient mRNA capture. This results i...


12: DeepAD: Alzheimer′s Disease Classification via Deep Convolutional Neural Networks using MRI and fMRI

Saman Sarraf, Danielle D. DeSouza et al.

12,813 downloads (posted 21 Aug 2016)

To extract patterns from neuroimaging data, various statistical methods and machine learning algorithms have been explored for the diagnosis of Alzheimer′s disease among older adults in both clinical and research applications; however, distinguishing between Alzheimer′s and healthy brain data has been challenging in older adults (age > 75) due to highly similar patterns of brain atrophy and image intensities. Recently, cutting-edge deep learning technologies have rapidly expanded into numerous fields, including medical ...


13: Ancestry Composition: A Novel, Efficient Pipeline for Ancestry Deconvolution

Eric Y Durand, Chuong B Do et al.

12,270 downloads (posted 18 Oct 2014)

Ancestry deconvolution, the task of identifying the ancestral origin of chromosomal segments in admixed individuals, has important implications, from mapping disease genes to identifying candidate loci under natural selection. To date, however, most existing methods for ancestry deconvolution are typically limited to two or three ancestral populations, and cannot resolve contributions from populations related at a sub-continental scale. We describe Ancestry Composition, a modular three-stage pipeline that efficiently an...


14: Visualizing Structure and Transitions for Biological Data Exploration

Kevin R Moon, David van Dijk et al.

11,193 downloads (posted 24 Mar 2017)

With the advent of high-throughput technologies measuring high-dimensional biological data, there is a pressing need for visualization tools that reveal the structure and emergent patterns of data in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure in data by an information-geometric distance between datapoints. We perform extensive comparison between PHATE and other tools on a variety of artificial and biological datasets, and find that it consistently ...


15: Privacy-preserving generative deep neural networks support clinical data sharing

Brett K Beaulieu-Jones, Zhiwei Steven Wu et al.

11,117 downloads (posted 05 Jul 2017)

Background: Data sharing accelerates scientific progress but sharing individual level data while preserving patient privacy presents a barrier. Methods and Results: Using pairs of deep neural networks, we generated simulated, synthetic "participants" that closely resemble participants of the SPRINT trial. We showed that such paired networks can be trained with differential privacy, a formal privacy framework that limits the likelihood that queries of the synthetic participants' data could identify a real a participant i...


16: Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing

Konstantin Berlin, Sergey Koren et al.

10,815 downloads (posted 14 Aug 2014)

We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherich...


17: Reconstruction of developmental landscapes by optimal-transport analysis of single-cell gene expression sheds light on cellular reprogramming.

Geoffrey Schiebinger, Jian Shu et al.

10,471 downloads (posted 27 Sep 2017)

Understanding the molecular programs that guide cellular differentiation during development is a major goal of modern biology. Here, we introduce an approach, WADDINGTON-OT, based on the mathematics of optimal transport, for inferring developmental landscapes, probabilistic cellular fates and dynamic trajectories from large-scale single-cell RNA-seq (scRNA-seq) data collected along a time course. We demonstrate the power of WADDINGTON-OT by applying the approach to study 65,781 scRNA-seq profiles collected at 10 time po...


18: RapMap: A Rapid, Sensitive and Accurate Tool for Mapping RNA-seq Reads to Transcriptomes

Avi Srivastava, Hirak Sarkar et al.

10,416 downloads (posted 22 Oct 2015)

Motivation: The alignment of sequencing reads to a transcriptome is a common and important step in many RNA-seq analysis tasks. When aligning RNA-seq reads directly to a transcriptome (as is common in the de novo setting or when a trusted reference annotation is available), care must be taken to report the potentially large number of multi-mapping locations per read. This can pose a substantial computational burden for existing aligners, and can considerably slow downstream analysis. Results: We introduce a novel concep...


19: qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots

Stephen D. Turner

10,243 downloads (posted 14 May 2014)

Summary: Genome-wide association studies (GWAS) have identified thousands of human trait-associated single nucleotide polymorphisms. Here, I describe a freely available R package for visualizing GWAS results using Q-Q and manhattan plots. The qqman package enables the flexible creation of manhattan plots, both genome-wide and for single chromosomes, with optional highlighting of SNPs of interest. Availability: qqman is released under the GNU General Public License, and is freely available on the Comprehensive R Archive ...


20: A complete bacterial genome assembled de novo using only nanopore sequencing data

Nicholas J Loman, Joshua Quick et al.

10,012 downloads (posted 20 Feb 2015)

A method for de novo assembly of data from the Oxford Nanopore MinION instrument is presented which is able to reconstruct the sequence of an entire bacterial chromosome in a single contig. Initially, overlaps between nanopore reads are detected. Reads are then subjected to one or more rounds of error correction by a multiple alignment process employing partial order graphs. After correction, reads are assembled using the Celera assembler. Finally, the assembly is polished using signal-level data from the nanopore emplo...