Rxivist logo

Normalization benchmark of ATAC-seq datasets shows the importance of accounting for GC-content effects

By Koen Van den Berge, Hsin-Jung Chou, Hector Roux de BĂ©zieux, Kelly N Street, Davide Risso, John Ngai, Sandrine Dudoit

Posted 27 Jan 2021
bioRxiv DOI: 10.1101/2021.01.26.428252

Modern assays have enabled high-throughput studies of epigenetic regulation of gene expression using DNA sequencing. In particular, the assay for transposase-accessible chromatin using sequencing (ATAC-seq) allows the study of chromatin configuration for an entire genome. Despite the gain in popularity of the assay, there have been limited studies investigating the analytical challenges related to ATAC-seq data, and most studies leverage tools developed for bulk transcriptome sequencing (RNA-seq). Here, we show that GC-content effects are omnipresent in ATAC-seq datasets. Since the GC-content effects are sample-specific, they can bias downstream analyses such as clustering and differential accessibility analysis. We evaluate twelve different normalization procedures on eight public ATAC-seq datasets and show that no method uniformly outperforms all others. However, our work clearly shows that accounting for GC-content effects in the normalization is crucial for common downstream ATAC-seq data analyses, such as clustering and differential accessibility analysis, leading to improved accuracy and interpretability of the results. Using two case studies, we show that exploratory data analysis is essential to guide the choice of an appropriate normalization method for a given dataset.

Download data

  • Downloaded 257 times
  • Download rankings, all-time:
    • Site-wide: 94,827
    • In bioinformatics: 8,282
  • Year to date:
    • Site-wide: 8,257
  • Since beginning of last month:
    • Site-wide: 37,380

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide


Sign up for the Rxivist weekly newsletter! (Click here for more details.)