Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 62,198 bioRxiv papers from 276,129 authors.
Most downloaded bioRxiv papers, all time
in category bioinformatics
6,100 results found. For more information, click each entry to expand.
9,740 downloads bioinformatics
Monitoring the progress of DNA molecules through a membrane pore has been postulated as a method for sequencing DNA for several decades. Recently, a nanopore-based sequencing instrument, the Oxford Nanopore MinION, has become available that we used for sequencing the S. cerevisiae genome. To make use of these data, we developed a novel open-source hybrid error correction algorithm Nanocorr (https://github.com/jgurtowski/nanocorr) specifically for Oxford Nanopore reads, as existing packages were incapable of assembling the long read lengths (5-50kbp) at such high error rate (between ~5 and 40% error). With this new method we were able to perform a hybrid error correction of the nanopore reads using complementary MiSeq data and produce a de novo assembly that is highly contiguous and accurate: the contig N50 length is more than ten-times greater than an Illumina-only assembly (678kb versus 59.9kbp), and has greater than 99.88% consensus identity when compared to the reference. Furthermore, the assembly with the long nanopore reads presents a much more complete representation of the features of the genome and correctly assembles gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly.
9,627 downloads bioinformatics
Alvaro N Barbeira, Scott P. Dickinson, Jason M Torres, Jiamao Zheng, Eric S. Torstenson, Heather E Wheeler, Kaanan P. Shah, Rodrigo Bonazzola, Tzintzuni Garcia, Todd Edwards, GTEx Consortium, Dan L Nicolae, Nancy J. Cox, Hae Kyung Im
Scalable, integrative methods to understand mechanisms that link genetic variants with phenotypes are needed. Here we derive a mathematical expression to compute PrediXcan (a gene mapping approach) results using summary data (S-PrediXcan) and show its accuracy and general robustness to misspecified reference sets. We apply this framework to 44 GTEx tissues and 100+ phenotypes from GWAS and meta-analysis studies, creating a growing public catalog of associations that seeks to capture the effects of gene expression variation on human phenotypes. Replication in an independent cohort is shown. Most of the associations were tissue specific, suggesting context specificity of the trait etiology. Colocalized significant associations in unexpected tissues underscore the need for an agnostic scanning of multiple contexts to improve our ability to detect causal regulatory mechanisms. Monogenic disease genes are enriched among significant associations for related traits, suggesting that smaller alterations of these genes may cause a spectrum of milder phenotypes.
8,948 downloads bioinformatics
Single-cell RNA-seq technologies enable high throughput gene expression measurement of individual cells, and allow the discovery of heterogeneity within cell populations. Measurement of cell-to-cell gene expression similarity is critical to identification, visualization and analysis of cell populations. However, single-cell data introduce challenges to conventional measures of gene expression similarity because of the high level of noise, outliers and dropouts. Here, we propose a novel similarity-learning framework, SIMLR (single-cell interpretation via multi-kernel learning), which learns an appropriate distance metric from the data for dimension reduction, clustering and visualization applications. Benchmarking against state-of-the-art methods for these applications, we used SIMLR to re-analyse seven representative single-cell data sets, including high-throughput droplet-based data sets with tens of thousands of cells. We show that SIMLR greatly improves clustering sensitivity and accuracy, as well as the visualization and interpretability of the data.
8,946 downloads bioinformatics
In 2017 World Health Organization announced the list of the most dangerous superbugs and among them is Pseudomonas aeruginosa, which is an antibiotic resistant opportunistic human pathogen as well as one of the "ESKAPE" pathogens. The central problem is that it affects patients suffering from AIDS, cystic fibrosis, cancer, burn victims etc. P. aeruginosa creates and inhabits surface-associated biofilms. Biofilms increase resistance to antibiotics and host immune responses, because of those current treatments are not effective. It is imperative to find new antibacterial treatment strategies against P. aeruginosa, but detailed molecular properties of the LasR protein are not clearly known to date. In the present study, we tried to analyse the molecular properties of the LasR protein as well as the mode of its interactions with autoinducer (AI) the N-3-oxododecanoyl homoserine lactone (3-O-C12-HSL). We performed docking and molecular dynamics (MD) simulations of the LasR protein of P. aeruginosa with the 3-O-C12-HSL ligand. We assessed the conformational changes of the interaction and analysed the molecular details of the binding of the 3-O-C12-HSL with LasR. A new interaction site of the 3-O-C12-HSL with LasR protein was found, which involves interaction with conservative residues from ligand binding domain (LBD), beta turns in the short linker region (SLR) and DNA binding domain (DBD). It will be referenced as the LBD-SLR-DBD bridge interaction or "the bridge". We have also performed LasR monomer protein docking and found a new form of dimerization. This study may offer new insights for future experimental studies to detect the interaction of the autoinducer with "the bridge" of LasR protein and a new interaction site for drug design.
8,487 downloads bioinformatics
Background: Next generation sequencing (NGS) is a widely used technology in both basic research and clinical settings and it will continue to have a major impact on biomedical sciences. However, the use of incorrect normalization methods can lead to systematic biases and spurious results, making the selection of an appropriate normalization strategy a crucial and often overlooked part of NGS analysis. Results: We present a basic introduction to the currently available normalization methods for differential expression and ChIP-seq applications, along with best use recommendations for different experimental techniques and datasets. We demonstrate that the choice of normalization technique can have a significant impact on the number of genes called as differentially expressed in an RNA-seq experiment or peaks called in a ChIP-seq experiment. Conclusions: The choice of the most adequate normalization method depends on both the distribution of signal in the dataset and the intended downstream applications. Depending on the design and purpose of the study, appropriate bias correction should also be considered.
8,086 downloads bioinformatics
The human gut microbiome is a complex ecosystem that both affects and is affected by its host status. Previous analyses of gut microflora revealed associations between specific microbes and host health and disease status, genotype and diet. Here, we developed a method of predicting the biological age of the host based on the microbiological profiles of gut microbiota using a curated dataset of 1,165 healthy individuals (1,663 microbiome samples). Our predictive model, a human microbiome clock, has an architecture of a deep neural network and achieves the accuracy of 3.94 years mean absolute error in cross-validation. The performance of the deep microbiome clock was also evaluated on several additional populations. We further introduce a platform for biological interpretation of individual microbial features used in age models, which relies on permutation feature importance and accumulated local effects. This approach has allowed us to define two lists of 95 intestinal biomarkers of human aging. We further show that this list can be reduced to 39 taxa that convey the most information on their host's aging. Overall, we show that (a) microbiological profiles can be used to predict human age; and (b) microbial features selected by models are age-related.
7,765 downloads bioinformatics
Systems biology approaches, leveraging multi-omics measurements, are needed to capture the complexity of biological networks while identifying the key molecular drivers of disease mechanisms. We present DIABLO, a novel integrative method to identify multi-omics biomarker panels that can discriminate between multiple phenotypic groups. In the multi-omics analyses of simulated and real-world datasets, DIABLO resulted in superior biological enrichment compared to other integrative methods, and achieved comparable predictive performance with existing multi-step classification schemes. DIABLO is a versatile approach that will benefit a diverse range of research areas, where multiple high dimensional datasets are available for the same set of specimens. DIABLO is implemented along with tools for model selection, and validation, as well as graphical outputs to assist in the interpretation of these integrative analyses (http://mixomics.org/).
7,727 downloads bioinformatics
Single-cell transcriptomics yields ever growing data sets containing RNA expression levels for thousands of genes from up to millions of cells. Common data analysis pipelines include a dimensionality reduction step for visualising the data in two dimensions, most frequently performed using t-distributed stochastic neighbour embedding (t-SNE). It excels at revealing local structure in high-dimensional data, but naive applications often suffer from severe shortcomings, e.g. the global structure of the data is not represented accurately. Here we describe how to circumvent such pitfalls, and develop a protocol for creating more faithful t-SNE visualisations. It includes PCA initialisation, a high learning rate, and multi-scale similarity kernels; for very large data sets, we additionally use exaggeration and downsampling-based initialisation. We use published single-cell RNA-seq data sets to demonstrate that this protocol yields superior results compared to the naive application of t-SNE.
7,685 downloads bioinformatics
Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard work flows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data such as genomic coordinates, sequences, sequencing reads, alignments, gene model information, variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability: HTSeq is released as open-source software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index, https://pypi.python.org/pypi/HTSeq
7,624 downloads bioinformatics
We describe a novel method for the differential analysis of RNA-Seq data that utilizes bootstrapping in conjunction with response error linear modeling to decouple biological variance from inferential variance. The method is implemented in an interactive shiny app called sleuth that utilizes kallisto quantifications and bootstraps for fast and accurate analysis of RNA-Seq experiments.
7,608 downloads bioinformatics
Adam M Novak, Glenn Hickey, Erik Garrison, Sean Blum, Abram Connelly, Alexander Dilthey, Jordan Eizenga, M. A. Saleh Elmohamed, Sally Guthrie, André Kahles, Stephen Keenan, Jerome Kelleher, Deniz Kural, Heng Li, Michael F Lin, Karen Miga, Nancy Ouyang, Goran Rakocevic, Maciek Smuga-Otto, Alexander Wait Zaranek, Richard Durbin, Gil McVean, David Haussler, Benedict Paten
There is increasing recognition that a single, monoploid reference genome is a poor universal reference structure for human genetics, because it represents only a tiny fraction of human variation. Adding this missing variation results in a structure that can be described as a mathematical graph: a genome graph. We demonstrate that, in comparison to the existing reference genome (GRCh38), genome graphs can substantially improve the fractions of reads that map uniquely and perfectly. Furthermore, we show that this fundamental simplification of read mapping transforms the variant calling problem from one in which many non-reference variants must be discovered de-novo to one in which the vast majority of variants are simply re-identified within the graph. Using standard benchmarks as well as a novel reference-free evaluation, we show that a simplistic variant calling procedure on a genome graph can already call variants at least as well as, and in many cases better than, a state-of-the-art method on the linear human reference genome. We anticipate that graph-based references will supplant linear references in humans and in other applications where cohorts of sequenced individuals are available.
7,560 downloads bioinformatics
Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either PacBio or Oxford Nanopore technologies, and achieves a contig NG50 of greater than 21 Mbp on both human and Drosophila melanogaster PacBio datasets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.
7,517 downloads bioinformatics
Transcriptome profiles of individual cells reflect true and often unexplored biological diversity, but are also affected by noise of biological and technical nature. This raises the need to explicitly model the resulting uncertainty and take it into account in any downstream analysis, such as dimensionality reduction, clustering, and differential expression. Here, we introduce Single-cell Variational Inference (scVI), a scalable framework for probabilistic representation and analysis of gene expression in single cells. Our model uses variational inference and stochastic optimization of deep neural networks to approximate the parameters that govern the distribution of expression values of each gene in every cell, using a non-linear mapping between the observations and a low-dimensional latent space. By doing so, scVI pools information between similar cells or genes while taking nuisance factors of variation such as batch effects and limited sensitivity into account. To evaluate scVI, we conducted a comprehensive comparative analysis to existing methods for distributional modeling and dimensionality reduction, all of which rely on generalized linear models. We first show that scVI scales to over one million cells, whereas competing algorithms can process at most tens of thousands of cells. Next, we show that scVI fits unseen data more closely and can impute missing data more accurately, both indicative of a better generalization capacity. We then utilize scVI to conduct a set of fundamental analysis tasks -- including batch correction, visualization, clustering and differential expression -- and demonstrate its accuracy in comparison to the state-of-the-art tools in each task. scVI is publicly available, and can be readily used as a principled and inclusive solution for multiple tasks of single-cell RNA sequencing data analysis.
7,101 downloads bioinformatics
Partial Least-Squares Discriminant Analysis (PLS-DA) is a popular machine learning tool that is gaining increasing attention as a useful feature selector and classifier. In an effort to understand its strengths and weaknesses, we performed a series of experiments with synthetic data and compared its performance to its close relative from which it was initially invented, namely Principal Component Analysis (PCA). We demonstrate that even though PCA ignores the information regarding the class labels of the samples, this unsupervised tool can be remarkably effective as a dimensionality reducer and a feature selector. In some cases, it outperforms PLS-DA, which is made aware of the class labels in its input. Our experiments range from looking at the signal-to-noise ratio in the feature selection task, to considering many practical distributions and models for the synthetic data sets used. Our experiments consider many useful distributions encountered when analyzing bioinformatics and clinical data, especially in the context of machine learning, where it is hoped that the program automatically extracts and/or learns the hidden relationships.
7,091 downloads bioinformatics
Basecalling, the computational process of translating raw electrical signal to nucleotide sequence, is of critical importance to the sequencing platforms produced by Oxford Nanopore Technologies (ONT). Here we examine the performance of different basecalling tools, looking at accuracy at the level of bases within individual reads and at majority-rules consensus basecalls in an assembly. We also investigate some additional aspects of basecalling: training using a taxon-specific dataset, using a larger neural network model and improving consensus basecalls in an assembly by additional signal-level analysis with Nanopolish. Training basecallers on taxon-specific data results in a significant boost in consensus accuracy, mostly due to the reduction of errors in methylation motifs. A larger neural network is able to improve both read and consensus accuracy, but at a cost to speed. Improving consensus sequences ('polishing') with Nanopolish somewhat negates the accuracy differences in basecallers, but prepolish accuracy does have an effect on post-polish accuracy. Basecalling accuracy has seen significant improvements over the last two years. The current version of ONT's Guppy basecaller performs well overall, with good accuracy and fast performance. If higher accuracy is required, users should consider producing a custom model using a larger neural network and/or training data from the same species.
7,044 downloads bioinformatics
Pathway enrichment analysis helps gain mechanistic insight into large gene lists typically resulting from genome scale (-omics) experiments. It identifies biological pathways that are enriched in the gene list more than expected by chance. We explain pathway enrichment analysis and present a practical step-by-step guide to help interpret gene lists resulting from RNA-seq and genome sequencing experiments. The protocol comprises three major steps: define a gene list from genome scale data, determine statistically enriched pathways, and visualize and interpret the results. We focus on differentially expressed genes and mutated cancer genes, however the described principles can be applied to diverse -omics data. The protocol is designed for biologists with no prior bioinformatics training and uses freely available software including g:Profiler, GSEA, Cytoscape and Enrichment Map.
6,899 downloads bioinformatics
Jie Zheng, A Mesut Erzurumluoglu, Benjamin L Elsworth, Laurence Howe, Philip C. Haycock, Gibran Hemani, Katherine Tansey, Charles Laurin, Early Genetics and Lifecourse Epidemiology (EAGLE) Eczema Consortium, Beate St Pourcain, Nicole M. Warrington, Hilary K Finucane, Alkes L. Price, Brendan K Bulik-Sullivan, Verneri Anttila, Lavinia Paternoster, Tom R Gaunt, David M Evans, Benjamin M Neale
Motivation: LD score regression is a reliable and efficient method of using genome-wide association study (GWAS) summary-level results data to estimate the SNP heritability of complex traits and diseases, partition this heritability into functional categories, and estimate the genetic correlation between different phenotypes. Because the method relies on summary level results data, LD score regression is computationally tractable even for very large sample sizes. However, publicly available GWAS summary-level data are typically stored in different databases and have different formats, making it difficult to apply LD score regression to estimate genetic correlations across many different traits simultaneously. Results: In this manuscript, we describe LD Hub - a centralized database of summary-level GWAS results for 177 diseases/traits from different publicly available resources/consortia and a web interface that automates the LD score regression analysis pipeline. To demonstrate functionality and validate our software, we replicated previously reported LD score regression analyses of 49 traits/diseases using LD Hub; and estimated SNP heritability and the genetic correlation across the different phenotypes. We also present new results obtained by uploading a recent atopic dermatitis GWAS meta-analysis to examine the genetic correlation between the condition and other potentially related traits. In response to the growing availability of publicly accessible GWAS summary-level results data, our database and the accompanying web interface will ensure maximal uptake of the LD score regression methodology, provide a useful database for the public dissemination of GWAS results, and provide a method for easily screening hundreds of traits for overlapping genetic aetiologies. Availability and implementation: The web interface and instructions for using LD Hub are available at http://ldsc.broadinstitute.org/
6,841 downloads bioinformatics
Konstantinos Vougas, Magdalena Krochmal, Thomas Jackson, Alexander Polyzos, Archimides Aggelopoulos, Ioannis S Pateras, Michael Liontos, Anastasia Varvarigou, Elizabeth O Johnson, Vassilis Georgoulias, Antonia Vlahou, Paul Townsend, Dimitris Thanos, Jiri Bartek, Vassilis G Gorgoulis
A major challenge in cancer treatment is predicting the clinical response to anti-cancer drugs for each individual patient. For complex diseases such as cancer, characterized by high inter-patient variance, the implementation of precision medicine approaches is dependent upon understanding the pathological processes at the molecular level. While the 'omics' era provides unique opportunities to dissect the molecular features of diseases, the ability to utilize it in targeted therapeutic efforts is hindered by both the massive size and diverse nature of the 'omics' data. Recent advances with Deep Learning Neural Networks (DLNNs), suggests that DLNN could be trained on large data sets to efficiently predict therapeutic responses in cancer treatment. We present the application of Association Rule Mining combined with DLNNs for the analysis of high-throughput molecular profiles of 1001 cancer cell lines, in order to extract cancer-specific signatures in the form of easily interpretable rules and use these rules as input to predict pharmacological responses to a large number of anti-cancer drugs. The proposed algorithm outperformed Random Forests (RF) and Bayesian Multitask Multiple Kernel Learning (BMMKL) classification which currently represent the state-of-the-art in drug-response prediction. Moreover, the in silico pipeline presented, introduces a novel strategy for identifying potential therapeutic targets, as well as possible drug combinations with high therapeutic potential. For the first time, we demonstrate that DLNNs trained on a large pharmacogenomics data-set can effectively predict the therapeutic response of specific drugs in different cancer types. These findings serve as a proof of concept for the application of DLNNs to predict therapeutic responsiveness, a milestone in precision medicine.
6,800 downloads bioinformatics
Advances in nanopore sequencing technology have enabled investigation of the full catalogue of covalent DNA modifications. We present the first algorithm for the identification of modified nucleotides without the need for prior training data along with the open source software implementation, nanoraw. Nanoraw accurately assigns contiguous raw nanopore signal to genomic positions, enabling novel data visualization, and increasing power and accuracy for the discovery of covalently modified bases in native DNA. Ground truth case studies utilizing synthetically methylated DNA show the capacity to identify three distinct methylation marks, 4mC, 5mC, and 6mA, in seven distinct sequence contexts without any changes to the algorithm. We demonstrate quantitative reproducibility simultaneously identifying 5mC and 6mA in native E. coli across biological replicates processed in different labs. Finally we propose a pipeline for the comprehensive discovery of DNA modifications in any genome without a priori knowledge of their chemical identities.
6,695 downloads bioinformatics
Software to call single nucleotide polymorphisms or related genetic variants has converged on the variant call format (VCF) as the output format of choice. This has created a need for tools to work with VCF files. While an increasing number of software exists to read VCF data, many only extract the genotypes without including the data associated with each genotype that describes its quality. We created the R package vcfR to address this issue. We developed a VCF file exploration tool implemented in the R language because R provides an interactive experience and an environment that is commonly used for genetic data analysis. Functions to read and write VCF files into R as well as functions to extract portions of the data and to plot summary statistics of the data are implemented. VcfR further provides the ability to visualize how various parameterizations of the data affect the results. Additional tools are included to integrate sequence (FASTA) and annotation data (GFF) for visualization of genomic regions such as chromosomes. Conversion functions translate data from the vcfR data structure to formats used by other R genetics packages. Computationally intensive functions are implemented in C++ to improve performance. Use of these tools is intended to facilitate VCF data exploration, including intuitive methods for data quality control and easy export to other R packages for further analysis. VcfR thus provides essential, novel tools currently not available in R.
- Top preprints of 2018
- Paper search
- Author leaderboards
- Overall metrics
- The API
- Email newsletter
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!