Rxivist logo

Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 72,919 bioRxiv papers from 317,430 authors.

Most downloaded bioRxiv papers, all time

in category bioinformatics

6,994 results found. For more information, click each entry to expand.

21: A complete bacterial genome assembled de novo using only nanopore sequencing data
more details view paper

Posted to bioRxiv 20 Feb 2015

A complete bacterial genome assembled de novo using only nanopore sequencing data
10,327 downloads bioinformatics

Nicholas J. Loman, Joshua Quick, Jared T Simpson

A method for de novo assembly of data from the Oxford Nanopore MinION instrument is presented which is able to reconstruct the sequence of an entire bacterial chromosome in a single contig. Initially, overlaps between nanopore reads are detected. Reads are then subjected to one or more rounds of error correction by a multiple alignment process employing partial order graphs. After correction, reads are assembled using the Celera assembler. Finally, the assembly is polished using signal-level data from the nanopore employing a novel hidden Markov model. We show that this method is able to assemble nanopore reads from Escherichia coli K-12 MG1655 into a single contig of length 4.6Mb permitting a full reconstruction of gene order. The resulting draft assembly has 98.4% nucleotide identity compared to the finished reference genome. After polishing the assembly with our signal-level HMM, the nucleotide identity is improved to 99.4%. We show that MinION sequencing data can be used to reconstruct genomes without the need for a reference sequence or data from other sequencing platforms.

22: Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics
more details view paper

Posted to bioRxiv 23 Mar 2016

Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics
10,004 downloads bioinformatics

Alvaro N. Barbeira, Scott P. Dickinson, Jason M. Torres, Jiamao Zheng, Eric S. Torstenson, Heather E. Wheeler, Kaanan P. Shah, Rodrigo Bonazzola, Tzintzuni Garcia, Todd Edwards, GTEx Consortium, Dan L Nicolae, Nancy J. Cox, Hae Kyung Im

Scalable, integrative methods to understand mechanisms that link genetic variants with phenotypes are needed. Here we derive a mathematical expression to compute PrediXcan (a gene mapping approach) results using summary data (S-PrediXcan) and show its accuracy and general robustness to misspecified reference sets. We apply this framework to 44 GTEx tissues and 100+ phenotypes from GWAS and meta-analysis studies, creating a growing public catalog of associations that seeks to capture the effects of gene expression variation on human phenotypes. Replication in an independent cohort is shown. Most of the associations were tissue specific, suggesting context specificity of the trait etiology. Colocalized significant associations in unexpected tissues underscore the need for an agnostic scanning of multiple contexts to improve our ability to detect causal regulatory mechanisms. Monogenic disease genes are enriched among significant associations for related traits, suggesting that smaller alterations of these genes may cause a spectrum of milder phenotypes.

23: Oxford Nanopore Sequencing, Hybrid Error Correction, and de novo Assembly of a Eukaryotic Genome
more details view paper

Posted to bioRxiv 06 Jan 2015

Oxford Nanopore Sequencing, Hybrid Error Correction, and de novo Assembly of a Eukaryotic Genome
9,818 downloads bioinformatics

Sara Goodwin, James Gurtowski, Scott Ethe-Sayers, Panchajanya Deshpande, Michael C. Schatz, W. Richard McCombie

Monitoring the progress of DNA molecules through a membrane pore has been postulated as a method for sequencing DNA for several decades. Recently, a nanopore-based sequencing instrument, the Oxford Nanopore MinION, has become available that we used for sequencing the S. cerevisiae genome. To make use of these data, we developed a novel open-source hybrid error correction algorithm Nanocorr (https://github.com/jgurtowski/nanocorr) specifically for Oxford Nanopore reads, as existing packages were incapable of assembling the long read lengths (5-50kbp) at such high error rate (between ~5 and 40% error). With this new method we were able to perform a hybrid error correction of the nanopore reads using complementary MiSeq data and produce a de novo assembly that is highly contiguous and accurate: the contig N50 length is more than ten-times greater than an Illumina-only assembly (678kb versus 59.9kbp), and has greater than 99.88% consensus identity when compared to the reference. Furthermore, the assembly with the long nanopore reads presents a much more complete representation of the features of the genome and correctly assembles gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly.

24: The art of using t-SNE for single-cell transcriptomics
more details view paper

Posted to bioRxiv 25 Oct 2018

The art of using t-SNE for single-cell transcriptomics
9,293 downloads bioinformatics

Dmitry Kobak, Philipp Berens

Single-cell transcriptomics yields ever growing data sets containing RNA expression levels for thousands of genes from up to millions of cells. Common data analysis pipelines include a dimensionality reduction step for visualising the data in two dimensions, most frequently performed using t-distributed stochastic neighbour embedding (t-SNE). It excels at revealing local structure in high-dimensional data, but naive applications often suffer from severe shortcomings, e.g. the global structure of the data is not represented accurately. Here we describe how to circumvent such pitfalls, and develop a protocol for creating more faithful t-SNE visualisations. It includes PCA initialisation, a high learning rate, and multi-scale similarity kernels; for very large data sets, we additionally use exaggeration and downsampling-based initialisation. We use published single-cell RNA-seq data sets to demonstrate that this protocol yields superior results compared to the naive application of t-SNE.

25: Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning
more details view paper

Posted to bioRxiv 09 May 2016

Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning
9,250 downloads bioinformatics

Bo Wang, Junjie Zhu, Emma Pierson, Daniele Ramazzotti, Serafim Batzoglou

Single-cell RNA-seq technologies enable high throughput gene expression measurement of individual cells, and allow the discovery of heterogeneity within cell populations. Measurement of cell-to-cell gene expression similarity is critical to identification, visualization and analysis of cell populations. However, single-cell data introduce challenges to conventional measures of gene expression similarity because of the high level of noise, outliers and dropouts. Here, we propose a novel similarity-learning framework, SIMLR (single-cell interpretation via multi-kernel learning), which learns an appropriate distance metric from the data for dimension reduction, clustering and visualization applications. Benchmarking against state-of-the-art methods for these applications, we used SIMLR to re-analyse seven representative single-cell data sets, including high-throughput droplet-based data sets with tens of thousands of cells. We show that SIMLR greatly improves clustering sensitivity and accuracy, as well as the visualization and interpretability of the data.

26: Human microbiome aging clocks based on deep learning and tandem of permutation feature importance and accumulated local effects
more details view paper

Posted to bioRxiv 28 Dec 2018

Human microbiome aging clocks based on deep learning and tandem of permutation feature importance and accumulated local effects
9,112 downloads bioinformatics

Fedor Galkin, Aleksandr Aliper, Evgeny Putin, Igor Kuznetsov, Vadim N. Gladyshev, Alex Zhavoronkov

The human gut microbiome is a complex ecosystem that both affects and is affected by its host status. Previous analyses of gut microflora revealed associations between specific microbes and host health and disease status, genotype and diet. Here, we developed a method of predicting the biological age of the host based on the microbiological profiles of gut microbiota using a curated dataset of 1,165 healthy individuals (1,663 microbiome samples). Our predictive model, a human microbiome clock, has an architecture of a deep neural network and achieves the accuracy of 3.94 years mean absolute error in cross-validation. The performance of the deep microbiome clock was also evaluated on several additional populations. We further introduce a platform for biological interpretation of individual microbial features used in age models, which relies on permutation feature importance and accumulated local effects. This approach has allowed us to define two lists of 95 intestinal biomarkers of human aging. We further show that this list can be reduced to 39 taxa that convey the most information on their host's aging. Overall, we show that (a) microbiological profiles can be used to predict human age; and (b) microbial features selected by models are age-related.

27: Interaction of N-3-oxododecanoyl homoserine lactone with transcriptional regulator LasR of Pseudomonas aeruginosa: Insights from molecular docking and dynamics simulations
more details view paper

Posted to bioRxiv 28 Mar 2017

Interaction of N-3-oxododecanoyl homoserine lactone with transcriptional regulator LasR of Pseudomonas aeruginosa: Insights from molecular docking and dynamics simulations
9,006 downloads bioinformatics

Hovakim Grabski, Lernik Hunanyan, Susanna Tiratsuyan, Hrachik Vardapetyan

In 2017 World Health Organization announced the list of the most dangerous superbugs and among them is Pseudomonas aeruginosa, which is an antibiotic resistant opportunistic human pathogen as well as one of the "ESKAPE" pathogens. The central problem is that it affects patients suffering from AIDS, cystic fibrosis, cancer, burn victims etc. P. aeruginosa creates and inhabits surface-associated biofilms. Biofilms increase resistance to antibiotics and host immune responses, because of those current treatments are not effective. It is imperative to find new antibacterial treatment strategies against P. aeruginosa, but detailed molecular properties of the LasR protein are not clearly known to date. In the present study, we tried to analyse the molecular properties of the LasR protein as well as the mode of its interactions with autoinducer (AI) the N-3-oxododecanoyl homoserine lactone (3-O-C12-HSL). We performed docking and molecular dynamics (MD) simulations of the LasR protein of P. aeruginosa with the 3-O-C12-HSL ligand. We assessed the conformational changes of the interaction and analysed the molecular details of the binding of the 3-O-C12-HSL with LasR. A new interaction site of the 3-O-C12-HSL with LasR protein was found, which involves interaction with conservative residues from ligand binding domain (LBD), beta turns in the short linker region (SLR) and DNA binding domain (DBD). It will be referenced as the LBD-SLR-DBD bridge interaction or "the bridge". We have also performed LasR monomer protein docking and found a new form of dimerization. This study may offer new insights for future experimental studies to detect the interaction of the autoinducer with "the bridge" of LasR protein and a new interaction site for drug design.

28: DIABLO: from multi-omics assays to biomarker discovery, an integrative approach
more details view paper

Posted to bioRxiv 03 Aug 2016

DIABLO: from multi-omics assays to biomarker discovery, an integrative approach
8,656 downloads bioinformatics

Amrit Singh, Casey P. Shannon, Benoît Gautier, Florian Rohart, Michaël Vacher, Scott J. Tebbutt, Kim-Anh Lê Cao

Systems biology approaches, leveraging multi-omics measurements, are needed to capture the complexity of biological networks while identifying the key molecular drivers of disease mechanisms. We present DIABLO, a novel integrative method to identify multi-omics biomarker panels that can discriminate between multiple phenotypic groups. In the multi-omics analyses of simulated and real-world datasets, DIABLO resulted in superior biological enrichment compared to other integrative methods, and achieved comparable predictive performance with existing multi-step classification schemes. DIABLO is a versatile approach that will benefit a diverse range of research areas, where multiple high dimensional datasets are available for the same set of specimens. DIABLO is implemented along with tools for model selection, and validation, as well as graphical outputs to assist in the interpretation of these integrative analyses (http://mixomics.org/).

29: Beyond library size: a field guide to NGS normalization
more details view paper

Posted to bioRxiv 19 Jun 2014

Beyond library size: a field guide to NGS normalization
8,561 downloads bioinformatics

Jelena Aleksic, Sarah H. Carl, Michaela Frye

Background: Next generation sequencing (NGS) is a widely used technology in both basic research and clinical settings and it will continue to have a major impact on biomedical sciences. However, the use of incorrect normalization methods can lead to systematic biases and spurious results, making the selection of an appropriate normalization strategy a crucial and often overlooked part of NGS analysis. Results: We present a basic introduction to the currently available normalization methods for differential expression and ChIP-seq applications, along with best use recommendations for different experimental techniques and datasets. We demonstrate that the choice of normalization technique can have a significant impact on the number of genes called as differentially expressed in an RNA-seq experiment or peaks called in a ChIP-seq experiment. Conclusions: The choice of the most adequate normalization method depends on both the distribution of signal in the dataset and the intended downstream applications. Depending on the design and purpose of the study, appropriate bias correction should also be considered.

30: So you think you can PLS-DA?
more details view paper

Posted to bioRxiv 21 Oct 2017

So you think you can PLS-DA?
8,150 downloads bioinformatics

Daniel Ruiz-Perez, Giri Narasimhan

Partial Least-Squares Discriminant Analysis (PLS-DA) is a popular machine learning tool that is gaining increasing attention as a useful feature selector and classifier. In an effort to understand its strengths and weaknesses, we performed a series of experiments with synthetic data and compared its performance to its close relative from which it was initially invented, namely Principal Component Analysis (PCA). We demonstrate that even though PCA ignores the information regarding the class labels of the samples, this unsupervised tool can be remarkably effective as a dimensionality reducer and a feature selector. In some cases, it outperforms PLS-DA, which is made aware of the class labels in its input. Our experiments range from looking at the signal-to-noise ratio in the feature selection task, to considering many practical distributions and models for the synthetic data sets used. Our experiments consider many useful distributions encountered when analyzing bioinformatics and clinical data, especially in the context of machine learning, where it is hoped that the program automatically extracts and/or learns the hidden relationships.

31: Pathway enrichment analysis of -omics data
more details view paper

Posted to bioRxiv 12 Dec 2017

Pathway enrichment analysis of -omics data
8,018 downloads bioinformatics

Jüri Reimand, Ruth Isserlin, Veronique Voisin, Mike Kucera, Christian Tannus-Lopes, Asha Rostamianfar, Lina Wadi, Mona Meyer, Jeff Wong, Changjiang Xu, Daniele Merico, Gary Bader

Pathway enrichment analysis helps gain mechanistic insight into large gene lists typically resulting from genome scale (-omics) experiments. It identifies biological pathways that are enriched in the gene list more than expected by chance. We explain pathway enrichment analysis and present a practical step-by-step guide to help interpret gene lists resulting from RNA-seq and genome sequencing experiments. The protocol comprises three major steps: define a gene list from genome scale data, determine statistically enriched pathways, and visualize and interpret the results. We focus on differentially expressed genes and mutated cancer genes, however the described principles can be applied to diverse -omics data. The protocol is designed for biologists with no prior bioinformatics training and uses freely available software including g:Profiler, GSEA, Cytoscape and Enrichment Map.

32: HTSeq - A Python framework to work with high-throughput sequencing data
more details view paper

Posted to bioRxiv 20 Feb 2014

HTSeq - A Python framework to work with high-throughput sequencing data
7,959 downloads bioinformatics

Simon Anders, Paul Theodor Pyl, Wolfgang Huber

Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard work flows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data such as genomic coordinates, sequences, sequencing reads, alignments, gene model information, variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability: HTSeq is released as open-source software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index, https://pypi.python.org/pypi/HTSeq

33: Genome Graphs
more details view paper

Posted to bioRxiv 18 Jan 2017

Genome Graphs
7,944 downloads bioinformatics

Adam M. Novak, Glenn Hickey, Erik Garrison, Sean Blum, Abram Connelly, Alexander Dilthey, Jordan Eizenga, M. A. Saleh Elmohamed, Sally Guthrie, André Kahles, Stephen Keenan, Jerome Kelleher, Deniz Kural, Heng Li, Michael F Lin, Karen Miga, Nancy Ouyang, Goran Rakocevic, Maciek Smuga-Otto, Alexander Wait Zaranek, Richard Durbin, Gil McVean, David Haussler, Benedict Paten

There is increasing recognition that a single, monoploid reference genome is a poor universal reference structure for human genetics, because it represents only a tiny fraction of human variation. Adding this missing variation results in a structure that can be described as a mathematical graph: a genome graph. We demonstrate that, in comparison to the existing reference genome (GRCh38), genome graphs can substantially improve the fractions of reads that map uniquely and perfectly. Furthermore, we show that this fundamental simplification of read mapping transforms the variant calling problem from one in which many non-reference variants must be discovered de-novo to one in which the vast majority of variants are simply re-identified within the graph. Using standard benchmarks as well as a novel reference-free evaluation, we show that a simplistic variant calling procedure on a genome graph can already call variants at least as well as, and in many cases better than, a state-of-the-art method on the linear human reference genome. We anticipate that graph-based references will supplant linear references in humans and in other applications where cohorts of sequenced individuals are available.

34: Differential analysis of RNA-Seq incorporating quantification uncertainty
more details view paper

Posted to bioRxiv 10 Jun 2016

Differential analysis of RNA-Seq incorporating quantification uncertainty
7,911 downloads bioinformatics

Harold Pimentel, Bray Nicolas L., Suzette Puente, Páll Melsted, Lior Pachter

We describe a novel method for the differential analysis of RNA-Seq data that utilizes bootstrapping in conjunction with response error linear modeling to decouple biological variance from inferential variance. The method is implemented in an interactive shiny app called sleuth that utilizes kallisto quantifications and bootstraps for fast and accurate analysis of RNA-Seq experiments.

35: Bayesian Inference for a Generative Model of Transcriptome Profiles from Single-cell RNA Sequencing
more details view paper

Posted to bioRxiv 30 Mar 2018

Bayesian Inference for a Generative Model of Transcriptome Profiles from Single-cell RNA Sequencing
7,895 downloads bioinformatics

Romain Lopez, Jeffrey Regier, Michael Cole, Michael Jordan, Nir Yosef

Transcriptome profiles of individual cells reflect true and often unexplored biological diversity, but are also affected by noise of biological and technical nature. This raises the need to explicitly model the resulting uncertainty and take it into account in any downstream analysis, such as dimensionality reduction, clustering, and differential expression. Here, we introduce Single-cell Variational Inference (scVI), a scalable framework for probabilistic representation and analysis of gene expression in single cells. Our model uses variational inference and stochastic optimization of deep neural networks to approximate the parameters that govern the distribution of expression values of each gene in every cell, using a non-linear mapping between the observations and a low-dimensional latent space. By doing so, scVI pools information between similar cells or genes while taking nuisance factors of variation such as batch effects and limited sensitivity into account. To evaluate scVI, we conducted a comprehensive comparative analysis to existing methods for distributional modeling and dimensionality reduction, all of which rely on generalized linear models. We first show that scVI scales to over one million cells, whereas competing algorithms can process at most tens of thousands of cells. Next, we show that scVI fits unseen data more closely and can impute missing data more accurately, both indicative of a better generalization capacity. We then utilize scVI to conduct a set of fundamental analysis tasks -- including batch correction, visualization, clustering and differential expression -- and demonstrate its accuracy in comparison to the state-of-the-art tools in each task. scVI is publicly available, and can be readily used as a principled and inclusive solution for multiple tasks of single-cell RNA sequencing data analysis.

36: Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation
more details view paper

Posted to bioRxiv 24 Aug 2016

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation
7,815 downloads bioinformatics

Sergey Koren, Brian P. Walenz, Konstantin Berlin, Jason R Miller, Nicholas H. Bergman, Adam M. Phillippy

Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either PacBio or Oxford Nanopore technologies, and achieves a contig NG50 of greater than 21 Mbp on both human and Drosophila melanogaster PacBio datasets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.

37: Performance of neural network basecalling tools for Oxford Nanopore sequencing
more details view paper

Posted to bioRxiv 07 Feb 2019

Performance of neural network basecalling tools for Oxford Nanopore sequencing
7,773 downloads bioinformatics

Ryan Wick, Louise M. Judd, Kathryn E. Holt

Basecalling, the computational process of translating raw electrical signal to nucleotide sequence, is of critical importance to the sequencing platforms produced by Oxford Nanopore Technologies (ONT). Here we examine the performance of different basecalling tools, looking at accuracy at the level of bases within individual reads and at majority-rules consensus basecalls in an assembly. We also investigate some additional aspects of basecalling: training using a taxon-specific dataset, using a larger neural network model and improving consensus basecalls in an assembly by additional signal-level analysis with Nanopolish. Training basecallers on taxon-specific data results in a significant boost in consensus accuracy, mostly due to the reduction of errors in methylation motifs. A larger neural network is able to improve both read and consensus accuracy, but at a cost to speed. Improving consensus sequences ('polishing') with Nanopolish somewhat negates the accuracy differences in basecallers, but prepolish accuracy does have an effect on post-polish accuracy. Basecalling accuracy has seen significant improvements over the last two years. The current version of ONT's Guppy basecaller performs well overall, with good accuracy and fast performance. If higher accuracy is required, users should consider producing a custom model using a larger neural network and/or training data from the same species.

38: Deep Learning and Association Rule Mining for Predicting Drug Response in Cancer. A Personalised Medicine Approach.
more details view paper

Posted to bioRxiv 19 Aug 2016

Deep Learning and Association Rule Mining for Predicting Drug Response in Cancer. A Personalised Medicine Approach.
7,555 downloads bioinformatics

Konstantinos Vougas, Magdalena Krochmal, Thomas Jackson, Alexander Polyzos, Archimides Aggelopoulos, Ioannis S Pateras, Michael Liontos, Anastasia Varvarigou, Elizabeth O Johnson, Vassilis Georgoulias, Antonia Vlahou, Paul Townsend, Dimitris Thanos, Jiri Bartek, Vassilis G Gorgoulis

A major challenge in cancer treatment is predicting the clinical response to anti-cancer drugs for each individual patient. For complex diseases such as cancer, characterized by high inter-patient variance, the implementation of precision medicine approaches is dependent upon understanding the pathological processes at the molecular level. While the 'omics' era provides unique opportunities to dissect the molecular features of diseases, the ability to utilize it in targeted therapeutic efforts is hindered by both the massive size and diverse nature of the 'omics' data. Recent advances with Deep Learning Neural Networks (DLNNs), suggests that DLNN could be trained on large data sets to efficiently predict therapeutic responses in cancer treatment. We present the application of Association Rule Mining combined with DLNNs for the analysis of high-throughput molecular profiles of 1001 cancer cell lines, in order to extract cancer-specific signatures in the form of easily interpretable rules and use these rules as input to predict pharmacological responses to a large number of anti-cancer drugs. The proposed algorithm outperformed Random Forests (RF) and Bayesian Multitask Multiple Kernel Learning (BMMKL) classification which currently represent the state-of-the-art in drug-response prediction. Moreover, the in silico pipeline presented, introduces a novel strategy for identifying potential therapeutic targets, as well as possible drug combinations with high therapeutic potential. For the first time, we demonstrate that DLNNs trained on a large pharmacogenomics data-set can effectively predict the therapeutic response of specific drugs in different cancer types. These findings serve as a proof of concept for the application of DLNNs to predict therapeutic responsiveness, a milestone in precision medicine.

39: A deep learning approach to pattern recognition for short DNA sequences
more details view paper

Posted to bioRxiv 22 Jun 2018

A deep learning approach to pattern recognition for short DNA sequences
7,469 downloads bioinformatics

Akosua Busia, George E. Dahl, Clara Fannjiang, David H. Alexander, Elizabeth Dorfman, Ryan Poplin, Cory Y McLean, Pi-Chuan Chang, Mark DePristo

Motivation Inferring properties of biological sequences--such as determining the species-of-origin of a DNA sequence or the function of an amino-acid sequence--is a core task in many bioinformatics applications. These tasks are often solved using string-matching to map query sequences to labeled database sequences or via Hidden Markov Model-like pattern matching. In the current work we describe and assess a deep learning approach which trains a deep neural network (DNN) to predict database-derived labels directly from query sequences. Results We demonstrate this DNN performs at state-of-the-art or above levels on a difficult, practically important problem: predicting species-of-origin from short reads of 16S ribosomal DNA. When trained on 16S sequences of over 13,000 distinct species, our DNN achieves read-level species classification accuracy within 2.0% of perfect memorization of training data, and produces more accurate genus-level assignments for reads from held-out species than k -mer, alignment, and taxonomic binning baselines. Moreover, our models exhibit greater robustness than these existing approaches to increasing noise in the query sequences. Finally, we show that these DNNs perform well on experimental 16S mock community dataset. Overall, our results constitute a first step towards our long-term goal of developing a general-purpose deep learning approach to predicting meaningful labels from short biological sequences. Availability TensorFlow training code is available through GitHub (<https://github.com/tensorflow/models/tree/master/research>). Data in TensorFlow TFRecord format is available on Google Cloud Storage (gs://brain-genomics-public/research/seq2species/). Contact seq2species-interest{at}google.com Supplementary information Supplementary data are available in a separate document.

40: VcfR: an R package to manipulate and visualize VCF format data
more details view paper

Posted to bioRxiv 26 Feb 2016

VcfR: an R package to manipulate and visualize VCF format data
7,285 downloads bioinformatics

Brian J. Knaus, Niklaus J. Grünwald

Software to call single nucleotide polymorphisms or related genetic variants has converged on the variant call format (VCF) as the output format of choice. This has created a need for tools to work with VCF files. While an increasing number of software exists to read VCF data, many only extract the genotypes without including the data associated with each genotype that describes its quality. We created the R package vcfR to address this issue. We developed a VCF file exploration tool implemented in the R language because R provides an interactive experience and an environment that is commonly used for genetic data analysis. Functions to read and write VCF files into R as well as functions to extract portions of the data and to plot summary statistics of the data are implemented. VcfR further provides the ability to visualize how various parameterizations of the data affect the results. Additional tools are included to integrate sequence (FASTA) and annotation data (GFF) for visualization of genomic regions such as chromosomes. Conversion functions translate data from the vcfR data structure to formats used by other R genetics packages. Computationally intensive functions are implemented in C++ to improve performance. Use of these tools is intended to facilitate VCF data exploration, including intuitive methods for data quality control and easy export to other R packages for further analysis. VcfR thus provides essential, novel tools currently not available in R.

Previous page 1 2 3 4 5 6 . . . 350 Next page


Sign up for the Rxivist weekly newsletter! (Click here for more details.)