Background: Variants in transcription factor binding sites (TFBSs) may have important regulatory effects, as they have the potential to alter transcription factor (TF) binding affinities and thereby affecting gene expression. With recent advances in sequencing technologies the number of variants identified in TFBSs has increased, hence understanding their role is of significant interest when interpreting next generation sequencing data. Current methods have two major limitations: they are limited to predicting the functional impact of single nucleotide variants (SNVs) and often rely on additional experimental data, laborious and expensive to acquire. We propose a purely bioinformatic method that addresses these two limitations while providing comparable results. Results: Our method uses position weight matrices and a sliding window approach, in order to account for the sequence context of variants, and scores the consequences of both SNVs and INDELs in TFBSs. We tested the accuracy of our method in two different ways. Firstly, we compared it to a recent method based on DNase I hypersensitive sites sequencing (DHS-seq) data designed to predict the effects of SNVs: we found a significant correlation of our score both with their DHS-seq data and their prediction model. Secondly, we called INDELs on publicly available DHS-seq data from ENCODE, and found our score to represent well the experimental data. We concluded that our method is reliable and we used it to describe the landscape of variation in TFBSs in the human genome, by scoring all variants in the 1000 Genomes Project Phase 3. Surprisingly, we found that most insertions have neutral effects on binding sites, while deletions, as expected, were found to have the most severe TFBS-scores. We identified four categories of variants based on their TFBS-scores and tested them for enrichment of variants classified as pathogenic, benign and protective in ClinVar: we found that the variants with the most negative TFBS-scores have the most significant enrichment for pathogenic variants. Conclusions: Our method addresses key shortcomings of currently available bioinformatic tools in predicting the effects of INDELs in TFBSs, and provides an unprecedented window into the genome-wide landscape of INDELs, their predicted influences on TF binding, and potential relevance for human diseases. We thus offer an additional tool to help prioritising non-coding variants in sequencing studies.
- Downloaded 518 times
- Download rankings, all-time:
- Site-wide: 34,914 out of 101,077
- In bioinformatics: 4,416 out of 9,276
- Year to date:
- Site-wide: 87,707 out of 101,077
- Since beginning of last month:
- Site-wide: 95,145 out of 101,077
Downloads over time
Distribution of downloads per paper, site-wide
- 20 Oct 2020: Support for sorting preprints using Twitter activity has been removed, at least temporarily, until a new source of social media activity data becomes available.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!