Rxivist logo

Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 63,016 bioRxiv papers from 279,490 authors.

Most downloaded bioRxiv papers, all time

in category bioinformatics

6,156 results found. For more information, click each entry to expand.

6001: A MALDI-MS biotyping-like method to address honey bee health status through computational modelling
more details view paper

Posted to bioRxiv 02 Jul 2019

A MALDI-MS biotyping-like method to address honey bee health status through computational modelling
84 downloads bioinformatics

Karim Arafah, Sébastien Nicolas Voisin, Victor Masson, Cédric Alaux, Yves Le Conte, Michel Bocquet, Philippe Bulet

Among pollinator insects, bees undoubtedly account for the most important species. They play a critical role in boosting reproduction of wild and commercial plants and therefore contribute to the maintenance of plant biodiversity and sustainability of food webs. In the last few decades, domesticated and wild bees have been subjected to biotic and abiotic threats, alone or in combination, causing various health disorders. Therefore, monitoring solutions to improve bee health are increasingly necessary. MALDI mass spectrometry has emerged within this decade as a powerful technology to biotype micro-organisms. This method is currently and routinely used in clinical diagnosis where molecular mass fingerprints corresponding to major protein signatures are matched against databases for real-time identification. Based on this strategy, we developed MALDI BeeTyping as a proof of concept to monitor significant hemolymph molecular changes in honey bees upon infection with a series of entomopathogenic Gram-positive and -negative bacteria. A Serratia marcescens strain isolated from one “naturally” infected honey bee collected from the field was also considered. We performed a series of individually recorded hemolymph molecular mass fingerprints and built, to our knowledge, the first computational model made of nine molecular signatures with a predictive score of 97.92%. Hence, we challenged our model by classifying a training set of individual bees’ hemolymph and obtained overall recognition of 91.93%. Through this work, we aimed at introducing a novel, realistic, and time-saving high-throughput biotyping-like strategy that addresses honey bee health in infectious conditions and on an individual scale through direct “blood tests”. Significance Statement Domesticated and wild bees worldwide represent the most active and valuable pollinators that ensure plant biodiversity and the success of many crops. These pollinators and others are exposed to deleterious pathogens and environmental stressors. Despite efforts to better understand how these threats affect honey bee health status, solutions are still crucially needed to help beekeepers, scientists and stakeholders in obtaining either a prognosis, an early diagnosis or a diagnosis of the health status of the apiaries. In this study, we describe a new method to investigate honey bee health by a simple “blood test” using fingerprints of some peptides/proteins as health status signatures. By computer modelling, we automated the identification of infected bees with a predictive score of 97.92%.

6002: PAREameters: computational inference of plant microRNA-mRNA targeting rules using RNA sequencing data
more details view paper

Posted to bioRxiv 22 Jul 2019

PAREameters: computational inference of plant microRNA-mRNA targeting rules using RNA sequencing data
83 downloads bioinformatics

Joshua Thody, Vincent Moulton, Irina Mohorianu

MicroRNAs (miRNAs) are short, non-coding RNAs that influence the translation-rate of mRNAs by directing the RNA-induced silencing complex to sequence-specific targets. In plants, this typically results in cleavage and subsequent degradation of the mRNA. This can be captured on a high-throughput scale using degradome sequencing, which supports miRNA target prediction by aligning degradation fragments to reference mRNAs enabling the identification of causal miRNA(s). The current criteria used for target prediction were inferred on experimentally validated A. thaliana interactions, which were adapted to fit that specific subset of miRNA interactions. In addition, the miRNA pathway in other organisms may have acquired specific changes, e.g. lineage-specific miRNAs or new miRNA-mRNA interactions, thus previous criteria may not be optimal. We present a new tool, PAREameters, for inferring targeting criteria from RNA sequencing datasets; the stability of inferred criteria under subsampling and the effect of input-size are discussed. We first evaluate its performance using experimentally validated miRNA-mRNA interactions in multiple A. thaliana datasets, including conserved and species-specific miRNAs. We then perform comprehensive analyses on the differences in flower miRNA-mRNA interactions in several non-model organisms and quantify the observed variations. PAREameters highlights an increase in sensitivity on most tested datasets when data-inferred criteria are used.

6003: Rapid Reconstruction of Time-varying Gene Regulatory Networks with Limited Main Memory
more details view paper

Posted to bioRxiv 05 Sep 2019

Rapid Reconstruction of Time-varying Gene Regulatory Networks with Limited Main Memory
83 downloads bioinformatics

Saptarshi Pyne, Ashish Anand

Reconstruction of time-varying gene regulatory networks underlying a time-series gene expression data is a fundamental challenge in the computational systems biology. The challenge increases multi-fold if the target networks need to be constructed for hundreds to thousands of genes. There have been constant efforts to design an algorithm that can perform the reconstruction task correctly as well as can scale efficiently (with respect to both time and memory) to such a large number of genes. However, the existing algorithms either do not offer time-efficiency, or they offer it at other costs -- memory-inefficiency or imposition of a constraint, known as the 'smoothly time-varying assumption'. In this paper, two novel algorithms -- 'an algorithm for reconstructing Time-varying Gene regulatory networks with Shortlisted candidate regulators - which is Light on memory' (TGS-Lite) and 'TGS-Lite Plus' (TGS-Lite+) -- are proposed that are time-efficient, memory-efficient and do not impose the smoothly time-varying assumption. Additionally, they offer state-of-the-art reconstruction correctness as demonstrated with three benchmark datasets. Source Code: https://github.com/sap01/TGS-Lite-supplem/tree/master/sourcecode

6004: Thirty five novel nsSNPs may affect ADAMTS13 protein leading to Thrombotic thrombocytopenic purpura (TTP) using bioinformatics approach
more details view paper

Posted to bioRxiv 05 Sep 2019

Thirty five novel nsSNPs may affect ADAMTS13 protein leading to Thrombotic thrombocytopenic purpura (TTP) using bioinformatics approach
82 downloads bioinformatics

Tebyan Ameer Abdelhameed, Arwa Ibrahim Ahmed, Mujahed Ibrahim Mustafa, Amel Nasir Eltayeb, Fatima A Abdelrhman, Amal Basheer Ahmed, Najla Basheer Ahmed, Hiba Yassin Khadir, Huda Khalid Mohamed, Soada Ahmed Osman, Mohamed Ahmed Hassan

Background: Genetic polymorphisms in the ADAMTS13 gene are associated with thrombotic thrombocytopenic purpura or TTP, a life-threatening microangiopathic disorder. This study aims to predict the possible pathogenic SNPs of this gene and their impact on the protein structure and function using insilico methods. Methods: SNPs retrieved from the NCBI database were analyzed using several bioinformatics tools. The different algorithms applied collectively to predict the effect of single nucleotide substitution on both structure and function of the ADAMTS13 protein. Results: Fifty one mutations were found to be highly damaging to the structure and function of the protein. Of those, thirty five were novel nsSNPs not previously reported in the literature. Conclusion: According to our analysis we found thirty five nsSNPs effects on ADAMTS13protein leading to thrombotic thrombocytopenic purpura using computational approach. Bioinformatics tools are vital in prediction analysis, making use of increasingly voluminous biomedical data thereby providing markers for screening or for genetic mapping studies. Keywords: Thrombotic thrombocytopenic purpura (TTP), A Disintegrin and Metalloproteinase with Thrombospondin Motifs 13 (ADAMTS13), microangiopathic disorder, Bioinformatics, single nucleotide polymorphisms (SNPs), computational, insilico.

6005: Bayesian Linear Mixed Models for Motif Activity Analysis
more details view paper

Posted to bioRxiv 07 Oct 2019

Bayesian Linear Mixed Models for Motif Activity Analysis
82 downloads bioinformatics

Simone Lederer, Tom Heskes, Simon J. van Heeringen, Cornelis Albers

Motivation: Cellular identity and behavior is controlled by complex gene regulatory networks. Transcription factors (TFs) bind to specific DNA sequences to regulate the transcription of their target genes. On the basis of these TF motifs in cis-regulatory elements we can model the influence of TFs on gene expression. In such models of TF motif activity the data is usually modeled assuming a linear relationship between the motif activity and the gene expression level. A commonly used method to model motif influence is based on Ridge Regression. One important assumption of linear regression is the independence between samples. However, if samples are generated from the same cell line, tissue, or other biological source, this assumption may be invalid. This same assumption of independence is also applied to different, yet similar, experimental conditions, which may also be inappropriate. In theory, the independence assumption between samples could lead to loss in signal detection. Here we investigate whether a Bayesian model that allows for correlations results in more accurate inference of motif activities. Results: We extend the Ridge Regression to a Bayesian Linear Mixed Model, which allows us to model dependence between different samples. In a simulation study, we investigate the differences between the two model assumptions. We show that our Bayesian Linear Mixed Model implementation outperforms Ridge Regression in a simulation scenario where the noise, the signal that can not be explained by TF motifs, is uncorrelated. However, we demonstrate that there is no such gain in performance if the noise has a similar covariance structure over samples as the signal that can be explained by motifs. We give a mathematical explanation to why this is the case. Using two representative real data sets we show that at most ~ 40% of the signal is explained by motifs using the linear model. With these data there is no advantage to using the Bayesian Linear Mixed Model, due to the similarity of the covariance structure. Availability & Implementation: The project implementation is available at https://github.com/Sim19/SimGEXPwMotifs.

6006: Extraction of common task features in EEG-fMRI data using coupled tensor-tensor decomposition
more details view paper

Posted to bioRxiv 02 Jul 2019

Extraction of common task features in EEG-fMRI data using coupled tensor-tensor decomposition
82 downloads bioinformatics

Yaqub Jonmohamadi, Suresh Muthukumaraswamy, Joseph Chen, Jonathan Roberts, Ross Crawford, Ajay Pandey

The fusion of simultaneously recorded EEG and fMRI data is of great value to neuroscience research due to the complementary properties of the individual modalities. Traditionally, techniques such as PCA and ICA, which rely on strong strong non-physiological assumptions such as orthogonality and statistical independence, have been used for this purpose. Recently, tensor decomposition techniques such as parallel factor analysis have gained more popularity in neuroimaging applications as they are able to inherently contain the multidimensionality of neuroimaging data and achieve uniqueness in decomposition without imposing strong assumptions. Previously, the coupled matrix-tensor decomposition (CMTD) has been applied for the fusion of the EEG and fMRI. Only recently the coupled tensor-tensor decomposition (CTTD) has been proposed. Here for the first time, we propose the use of CTTD of a 4th order EEG tensor (space, time, frequency, and participant) and 3rd order fMRI tensor (space, time, participant), coupled partially in time and participant domains, for the extraction of the task related features in both modalities. We used both the sensor-level and source-level EEG for the coupling. The phase shifted paradigm signals were incorporated as the temporal initializers of the CTTD to extract the task related features. The validation of the approach is demonstrated on simultaneous EEG-fMRI recordings from six participants performing an N-Back memory task. The EEG and fMRI tensors were coupled in 9 components out of which 7 components had a high correlation (more than 0.85) with the task. The result of the fusion recapitulates the well-known attention network as being positively, and the default mode network working negatively time-locked to the memory task.

6007: Small molecule docking of DNA repair proteins associated with cancer survival following PCNA metagene adjustment: A potential novel class of repair inhibitors
more details view paper

Posted to bioRxiv 03 Dec 2018

Small molecule docking of DNA repair proteins associated with cancer survival following PCNA metagene adjustment: A potential novel class of repair inhibitors
82 downloads bioinformatics

Leif E Peterson

Natural and synthetic small molecules from the NCI Developmental Therapeutics Program (DTP) were employed in molecular dynamics-based docking with DNA repair proteins whose RNA-Seq based expression was associated with overall cancer survival (OS) after adjustment for the PCNA metagene. The compounds employed were required to elicit a sensitive response (vs. resistance) in more than half of the cell lines tested for each cancer. Methodological approaches included peptide sequence alignments and homology modeling for 3D protein structure determination, ligand preparation, docking, toxicity and ADME prediction. Docking was performed for unique lists of DNA repair proteins which predict OS for AML, cancers of the breast, lung, colon, and ovaries, GBM, melanoma, and renal papillary cancer. Results indicate hundreds of drug-like and lead-like ligands with best-pose binding energies less than -6 kcal/mol. Ligand solubility for the top 20 drug-like hits approached lower bounds, while lipophilicity was acceptable. Most ligands were also blood-brain barrier permeable with high intestinal absorption rates. While the majority of ligands lacked positive prediction for Herg channel blockage and Ames carcinogenicity, there was considerable variation for predicted fathead minnow, honey bee, and Tetrahymena pyriformis toxicity. The computational results suggest the potential for new targets and mechanisms of repair inhibition and can be directly employed for in vitro and in vivo confirmatory laboratory experiments to identify new targets of therapy for cancer survival.

6008: Gene prediction in heterogeneous cancer tissues and establishment of Least Absolute Shrinking and Selection Operator model of lung squamous cell carcinoma.
more details view paper

Posted to bioRxiv 16 Aug 2019

Gene prediction in heterogeneous cancer tissues and establishment of Least Absolute Shrinking and Selection Operator model of lung squamous cell carcinoma.
81 downloads bioinformatics

Ateeq Muhammed Khaliq, SharathChandra RG, Meenakshi Rajamohan

Background: This study is aimed to establish a Least Absolute Shrinking and Selection Operator (LASSO) model based on tumor heterogeneity to predict the best features of LUSC in various cancer subtypes. Methods: The RNASeq data of 505 LUSC cancer samples were downloaded from the TCGA database. Subsequent to the identification of differentially expressed genes (DEGs), the samples were divided into two subtypes based on the consensus clustering method. The subtypes were estimated with the abundance of immune and non-immune stromal cell populations which infiltrated tissue. LASSO model was established to predict each subtype's best genes. Enrichment pathway analysis was then carried out. Finally, the validity of the LUSC model for identifying features was established by the survival analysis. Results: 240 and 262 samples were clustered in Subtype-1 and Subtype-2 groups respectively. DEG analysis was performed on each subtype. A standard cutoff was applied and in total, 4586 genes were upregulated and 1495 were downregulated in case of subtype-1 and 5016 genes were upregulated and 3224 were downregulated in case of subtype-2. LASSO model was established to predict the best features from each subtypes, 49 and 34 most relevant genes were selected in subtype-1 and subtype-2. The abundance of tissue-infiltrates analysis distinguished the subtypes based on the expression pattern of immune infiltrates. Survival analysis showed that this model could effectively predict the best and distinct features in cancer subtypes. Discussion: This study suggests that the unsupervised clustering and LASSO model-based feature selection can be effectively used to predict relevant genes which might play an important role in cancer diagnosis.

6009: Epicardial therapy with atrial appendage micrografts salvages myocardium after infarction
more details view paper

Posted to bioRxiv 29 Jul 2019

Epicardial therapy with atrial appendage micrografts salvages myocardium after infarction
81 downloads bioinformatics

Xie Yanbo, Milla A Lampinen, Juuso M Takala, Vilbert T J Sikorski, Rabah Soliymani, Miika Tarkia, Maciej M Lalowski, Eero M A Mervaala, Markku Kupari, Zhe Zheng, Shengshou Hu, Ari Harjula, Esko Kankuri, AADC Consortium

Ischemic heart disease remains the leading cause of mortality and morbidity worldwide despite improved possibilities in medical care. Alongside interventional therapies, such as coronary artery bypass grafting, adjuvant tissue-engineered and cell-based treatments can provide regenerative improvement. Unfortunately, most of these advanced approaches require multiple lengthy and costly preparation stages without delivering significant clinical benefits. We evaluated the effect of epicardially delivered minute pieces of atrial appendage tissue material, defined as atrial appendage micrografts (AAMs), in mouse myocardial infarction model. An extracellular matrix patch was used to cover and fix the AAMs onto the surface of the infarcted heart. The matrix-covered AAMs salvaged the heart from infarction-induced loss of functional myocardium and attenuated scarring. Site-selective proteomics of injured ischemic and uninjured distal myocardium from AAM-treated and untreated tissue sections revealed an increased expression of several cardiac regeneration-associated proteins (i.e. periostin, transglutaminases and glutathione peroxidases) as well as activation of pathways responsible for angio- and cardiogenesis in relation to AAMs therapy. Epicardial delivery of AAMs encased in an extracellular matrix patch scaffold salvages functional cardiac tissue from ischemic injury and restricts fibrosis after myocardial infarction. Our results support the use of AAMs as tissue-based therapy adjuvants for salvaging the ischemic myocardium.

6010: Rapidly Computing the Phylogenetic Transfer Index
more details view paper

Posted to bioRxiv 22 Aug 2019

Rapidly Computing the Phylogenetic Transfer Index
81 downloads bioinformatics

Jakub Michal Truszkowski, Olivier Gascuel, Krister Swenson

Given trees T and T\_o on the same taxon set, the transfer index φ(b, T\_o ) is the number of taxa that need to be ignored so that the bipartition induced by branch b in T is equal to some bipartition in T_o. Recently, Lemoine et al. [13] used the transfer index to design a novel bootstrap analysis technique that improves on Felsenstein bootstrap on large, noisy data sets. In this work, we propose an algorithm that computes the transfer index for all branches b ϵ T in O(n log^3 n) time, which improves upon the current O(n^2 )-time algorithm by Lin, Rajan and Moret [14]. Our implementation is able to process pairs of trees with hundreds of thousands of taxa in minutes and considerably speeds up the method of Lemoine et al. on large data sets. We believe our algorithm can be useful for comparing large phylogenies, especially when some taxa are misplaced (e.g. due to horizontal gene transfer, recombination, or reconstruction errors).

6011: Omic-Sig: Utilizing Omics Data to Explore and Visualize Kinase-Substrate Interactions
more details view paper

Posted to bioRxiv 05 Sep 2019

Omic-Sig: Utilizing Omics Data to Explore and Visualize Kinase-Substrate Interactions
81 downloads bioinformatics

Tung-Shing Mamie Lih, David J. Clark, Hui Zhang

Protein phosphorylation is one of the most prevalent post-translational modifications, resulting from the activity of protein kinases phosphorylating specific substrates. Multiple cellular processes are regulated via protein phosphorylation, with aberrant signaling driven by dysregulation of phosphorylation events and associating with disease progression (e.g., cancer). Mass spectrometry-based phosphoprotomics approaches can be leveraged for studying alterations of kinase-substrate activity in clinical cohorts. However, the information gained via interrogation of global proteomes and transcriptomes can offer additional insight into the interaction of kinases and their respective substrates. Therefore, we have developed the bioinformatics, data visualization software tool, Omic-Sig, which can stratify prominent phospho-substrates and their associated kinases based on differential abundances between case and control samples (e.g., tumors and their normal adjacent tissues from a cancer cohort) in a multi-omics fashion. Omic-Sig is available at https://github.com/hzhangjhu/Omic-Sig.

6012: Sensitivity and robustness of comorbidity network analysis
more details view paper

Posted to bioRxiv 05 Sep 2019

Sensitivity and robustness of comorbidity network analysis
81 downloads bioinformatics

Jason Corey Brunson, Thomas P. Agresta, Reinhard C. Laubenbacher

Background: Comorbidity network analysis (CNA) is an increasingly popular approach in systems medicine, in which mathematical graphs encode epidemiological correlations (links) between diseases (nodes) inferred from their occurrence in an underlying patient population. A variety of methods have been used to infer properties of the constituent diseases or underlying populations from the network structure, but few have been validated or reproduced. Objectives: To test the robustness and sensitivity of several common CNA techniques to the source of population health data and the method of link determination. Methods: We obtained six sources of aggregated disease co-occurrence data, coded using varied ontologies, most of which were provided by the authors of CNAs. We constructed families of comorbidity networks from these data sets, in which links were determined using a range of statistical thresholds and measures of association. We calculated degree distributions, single-value statistics, and centrality rankings for these networks and evaluated their sensitivity to the source of data and link determination parameters. From two open-access sources of patient-level data, we constructed comorbidity networks using several multivariate models in addition to comparable pairwise models and evaluated differences between correlation estimates and network structure. Results: Global network statistics vary widely depending on the underlying population. Much of this variation is due to network density, which for our six data sets ranged over three orders of magnitude. The statistical threshold for link determination also had strong effects on global statistics, though at any fixed threshold the same patterns distinguished our six populations. The association measure used to quantify comorbid relations had smaller but discernible effects on global structure. Co-occurrence rates estimated using multivariate models were increasingly negative-shifted as models accounted for more effects. However, only associations between the most prevalent disorders were consistent from model to model. Centrality rankings were likewise similar when based on the same dataset using different constructions; but they were difficult to compare, and very different when comparable, between data sets, especially those using different ontologies. The most central disease codes were particular to the underlying populations and were often broad categories, injuries, or non-specific symptoms. Conclusions: CNAs can improve robustness and comparability by accounting for known limitations. In particular, we urge comorbidity network analysts (a) to include, where permissible, disaggregated disease occurrence data to allow more targeted reproduction and comparison of results; (b) to report differences in results obtained using different association measures, including both one of relative risk and one of correlation; (c) when identifying centrally located disorders, to carefully decide the most suitable ontology for this purpose; and, (d) when relevant to the interpretation of results, to compare them to those obtained using a multivariate model.

6013: StructureDistiller: Structural relevance scoring increases resilience of contact maps to false positive predictions
more details view paper

Posted to bioRxiv 11 Jul 2019

StructureDistiller: Structural relevance scoring increases resilience of contact maps to false positive predictions
81 downloads bioinformatics

Sebastian Bittrich, Michael Schroeder, Dirk Labudde

Protein folding and structure prediction are two sides of the same coin. We propose contact maps and the related techniques of constraint-based structure reconstruction as unifying aspect of both processes. The presented Structural Relevance (SR) score quantifies the contribution of individual contacts and residues to structural integrity. It is demonstrated that entries of a contact map are not equally relevant for structural integrity. Structure prediction methods should explicitly consider the most relevant contacts for optimal performance because they effectively double resilience toward false positively predicted contacts. Furthermore, knowledge of the most relevant contacts significantly increases reconstruction fidelity on sparse contact maps by 0.4 Å. Protein folding is commonly characterized with spatial and temporal resolution: some residues are Early Folding while others are Highly Stable with respect to unfolding events. Using the proposed SR score, we demonstrate that folding initiation and structure stabilization are distinct processes.

6014: Are methylation beta-values simplex distributed?
more details view paper

Posted to bioRxiv 05 Sep 2019

Are methylation beta-values simplex distributed?
81 downloads bioinformatics

Lara Nonell, Juan Ramon Gonzalez

DNA methylation plays an important role in the development and progression of disease. Beta-values are the standard methylation measures. Different statistical methods have been proposed to assess differences in methylation between conditions. However, most of them do not completely account for the distribution of beta-values. The simplex distribution can accommodate beta-values data. We hypothesize that simplex is a quite flexible distribution which is able to model methylation data. To test our hypothesis, we conducted several analyses using four real data sets obtained from microarrays and sequencing technologies. Standard data distributions were studied and modelled in comparison to the simplex. Besides, some simulations were conducted in different scenarios encompassing several distribution assumptions, regression models and sample sizes. Finally, we compared DNA methylation between females and males in order to benchmark the assessed methodologies under different scenarios. According to the results obtained by the simulations and real data analyses, DNA methylation data are concordant with the simplex distribution in many situations. Simplex regression models work well in small sample size data sets. However, when sample size increases, other models such as the beta regression or even the linear regression can be employed to assess group comparisons and obtain unbiased results. Based on these results, we can provide some practical recommendations when analyzing methylation data: 1) use data sets of at least 10 samples per studied condition for microarray data sets or 30 in NGS data sets, 2) apply a simplex or beta regression model for microarray data, 3) apply a linear model in any other case.

6015: PINOT: An Intuitive Resource for Integrating Protein-Protein Interactions
more details view paper

Posted to bioRxiv 30 Sep 2019

PINOT: An Intuitive Resource for Integrating Protein-Protein Interactions
81 downloads bioinformatics

James E Tomkins, Raffaele Ferrari, Nikoleta Vavouraki, John Hardy, Ruth C. Lovering, Patrick A Lewis, Liam J McGuffin, Claudia Manzoni

The past decade has seen the rise of omics data, for the understanding of biological systems in health and disease. This wealth of data includes protein-protein interaction (PPI) derived from both low and high-throughput assays, which is curated into multiple databases that capture the extent of available information from the peer-reviewed literature. Although these curation efforts are extremely useful, reliably downloading and integrating PPI data from the variety of available repositories is challenging and time consuming. We here present a novel user-friendly web-resource called PINOT (Protein Interaction Network Online Tool; available at http://www.reading.ac.uk/bioinf/PINOT/PINOT_form.html) to optimise the collection and processing of PPI data from the IMEx consortium associated repositories (members and observers) and from WormBase for constructing, respectively, human and C. elegans PPI networks. Users submit a query containing a list of proteins of interest for which PINOT will mine PPIs. PPI data is downloaded, merged, quality checked, and confidence scored based on the number of distinct methods and publications in which each interaction has been reported. Examples of PINOT applications are provided to highlight the performance, the ease of use and the potential applications of this tool. PINOT is a tool that allows users to survey the literature, extracting PPI data for a list of proteins of interest. The comparison with analogous tools showed that PINOT was able to extract similar numbers of PPIs while incorporating a set of innovative features. PINOT processes both small and large queries, it downloads PPIs live through PSICQUIC and it applies quality control filters on the downloaded PPI annotations (i.e. removing the need of manual inspection by the user). PINOT provides the user with information on detection methods and publication history for each of the downloaded interaction data entry and provides results in a table format that can be easily further customised and/or directly uploaded in a network visualization software.

6016: Minimum Error Calibration and Normalization for Genomic Copy Number Analysis
more details view paper

Posted to bioRxiv 31 Jul 2019

Minimum Error Calibration and Normalization for Genomic Copy Number Analysis
81 downloads bioinformatics

Bo Gao, Michael Baudis

Copy number variations (CNV) are regional deviations from the normal autosomal bi-allelic DNA content. While germline CNVs are a major contributor to genomic syndromes and inherited diseases, the majority of cancers accumulate extensive "somatic" CNV (sCNV or CNA) during the process of oncogenetic transformation and progression. While specific sCNV have closely been associated with tumorigenesis, intriguingly many neoplasias exhibit recurrent sCNV patterns beyond the involvement of a few cancer driver genes. Currently, CNV profiles of tumor samples are generated using genomic micro-arrays or high-throughput DNA sequencing. Regardless of the underlying technology, genomic copy number data is derived from the relative assessment and integration of multiple signals, with the data generation process being prone to contamination from several sources. Estimated copy number values have no absolute and linear correlation to their corresponding DNA levels, and the extent of deviation differs between sample profiles which poses a great challenge for data integration and comparison in large scale genome analysis. In this study, we present a novel method named Minimum Error Calibration and Normalization of Copy Numbers Analysis (Mecan4CNA). For each sCNV profile, Mecan4CNA reduces the noise level, calculates values representing the normal DNA copies (baseline) and the change of one copy (level distance), and finally normalizes all values. Experiments of Mecan4CNA on simulated data showed an overall accuracy of 93% and 91% in determining the baseline and level distance, respectively. Comparison of baseline and level distance estimation with existing methods and karyotyping data on the NCI-60 tumor cell line produced coherent results. To estimate the method's impact on downstream analyses we performed GISTIC analyses on original and Mecan4CNA data from the Cancer Genome Atlas (TCGA) where the normalized data showed prominent improvements of both sensitivity and specificity in detecting focal regions. In general, Mecan4CNA provides an advanced method for CNA data normalization especially in research involving data of high volume and heterogeneous quality. but with its informative output and visualization can also facilitate analysis of individual CNA profiles. Mecan4CNA is freely available as a Python package and through Github.

6017: MCSS-based Docking and Improved Scoring of Protein-Nucleotide Complexes: I. A step forward to the Fragment-Based Design of Oligonucleotides
more details view paper

Posted to bioRxiv 01 May 2019

MCSS-based Docking and Improved Scoring of Protein-Nucleotide Complexes: I. A step forward to the Fragment-Based Design of Oligonucleotides
80 downloads bioinformatics

Nicolas Chevrollier, Fabrice Leclerc

Computational fragment-based approaches have been widely used in drug design and drug discovery. One of the limitations for their application is the lack of performance of the scoring functions. With the emergence of new fragment-based approaches for single-stranded RNA ligands, we propose an analysis of the docking power of an MCSS-based approach evaluated on nucleotide binding sites. Combined with a clustering method and some optimal scoring functions, the results suggest that it could be used in the design of oligonucleotides.

6018: SOAPTyping: an open-source and cross-platform tool for Sanger sequence-based typing for HLA class I and II alleles
more details view paper

Posted to bioRxiv 20 Jun 2019

SOAPTyping: an open-source and cross-platform tool for Sanger sequence-based typing for HLA class I and II alleles
80 downloads bioinformatics

Yong Zhang, Yongsheng Chen, Huixin Xu, Lin Fang, Zijian Zhao, Weipeng Hu, Xiaoqin Yang, Jia Ye, Yun Cheng, Jiayin Wang, Jian Wang, Huanming Yang, Jing Yan

The human leukocyte antigen (HLA) gene family plays a key role in the immune response and thus is crucial in many biomedical and clinical settings. Utilizing Sanger sequencing, the gold standard technology for HLA typing, enables accurate identification of HLA alleles with high-resolution. However, there exists a current hurdle that only commercial software such as UType, SBT-Assign and SBTEngine, instead of any open source tools could be applied to perform HLA typing based on Sanger sequencing. To fill the gap, we developed a stand-alone, open-source and cross-platform software, known as SOAPTyping, for Sanger-based typing in HLA class I and II alleles. Availability and implementation: SOAPTyping is implemented in C++ language and Qt framework, which is supported on Windows, Mac and Linux. Source code and detailed documentation are accessible via the project GitHub page: https://github.com/BGI-flexlab/SOAPTyping.

6019: Maximizing the Reusability of Gene Expression Data by Predicting Missing Metadata
more details view paper

Posted to bioRxiv 03 Oct 2019

Maximizing the Reusability of Gene Expression Data by Predicting Missing Metadata
80 downloads bioinformatics

Pei-Yau Lung, Xiaodong Pang, Yan Li, Jinfeng Zhang

Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we develop a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We propose a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we show that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses.

6020: Statistical modeling of bacterial promoter sequences for regulatory motif discovery with the help of transcriptome data: application to Listeria monocytogenes
more details view paper

Posted to bioRxiv 02 Aug 2019

Statistical modeling of bacterial promoter sequences for regulatory motif discovery with the help of transcriptome data: application to Listeria monocytogenes
80 downloads bioinformatics

Pierre Nicolas, Ibrahim Sultan, Vincent Fromion, Sophie Schbath

Powerful algorithms have been developed for the discovery of bacterial transcription factors binding sites but automatic \textit{de novo} identification of the main regulons of a bacterium from genome and transcriptome data remains a challenge. To address this task, we propose a statistical model of the DNA sequence that can use information on exact positions of the transcription start sites and condition-dependent expression profiles. Two main novelties of this model are to allow overlaps between motif occurrences and to incorporate covariates summarizing expression profiles into the probabilities of motif occurrences in promoter regions. Covariates may correspond to the coordinates of the genes on axes such as obtained by Principal Component Analysis or Independent Component Analysis, or to the positions of the genes in a tree such as obtained by hierarchical clustering. All parameters are estimated in a Bayesian framework using a dedicated trans-dimensional Markov chain Monte Carlo algorithm that adjusts, simultaneously, for many motifs and many expression covariates: the width and palindromic properties of the corresponding position-weight matrices, the number of parameters to describe position with respect to the transcription start site, and the choice of relevant covariates. A data-set of 1,512 transcription start sites and 165 expression profiles available for \textit{Listeria monocytogenes} served to validate the approach, which proved able to predict regulons corresponding to many known transcription factors (SigB, CcpA, Rex, LiaR, Fur, LexA, VirR, Spx, BglR2, SigL, PrfA). The results provides thus a new global prediction on the transcription regulatory network of this model food-borne pathogen. For instance, a previously unreported motif was found upstream the promoters controlling the expression of many ribosomal proteins which suggests that an unknown transcription factor might play an important role in the regulation of growth.

Previous page 1 . . . 299 300 301 302 303 304 305 . . . 308 Next page

Sign up for the Rxivist weekly newsletter! (Click here for more details.)


News