Rxivist logo

Assessment of mapping strategies for determining the 5ʹ-end of mRNAs and long-noncoding RNAs with short read sequences

By Shuhei Noguchi, Hideya Kawaji, Takeya Kasukawa

Posted 15 Mar 2020
bioRxiv DOI: 10.1101/2020.03.14.982991

Genome mapping is an essential step in data processing for transcriptome analysis, and many previous studies have evaluated various methods and strategies for mapping RNA-seq data. Cap Analysis of Gene Expression (CAGE) is a sequencing-based protocol particularly designed to capture the 5'-ends of transcripts for quantitatively measuring the expression levels of transcription start sites genome-wide. Because CAGE analysis can also predict the activities of promoters and enhancers, this protocol has been an essential tool in studies of transcriptional regulation. Typically, the same mapping software is used to align both RNA-seq data and CAGE reads to a reference genome, but which mapping software and options are most appropriate for mapping the 5'-end sequence reads obtained through CAGE has not previously been evaluated systematically. Here we assessed various strategies for aligning CAGE reads, particularly ~50-bp sequences, with the human genome by using the HISAT2, LAST, and STAR programs both with and without a reference transcriptome. One of the major inconsistencies among the tested strategies involves alignments to pseudogenes and parent genes: some of the strategies prioritized alignments with pseudogenes even when the read could be aligned with coding genes with fewer mismatches. Another inconsistency concerned the detection of exon-exon junctions. These preferences depended on the program applied and whether a reference transcriptome was included. Overall, the choice of strategy yielded different mapping results for approximately 2% of all promoters. Although the various alignment strategies produced very similar results overall, we noted several important and measurable differences. In particular, using the reference transcriptome in STAR yielded alignments with the fewest mismatches. In addition, the inconsistencies among the strategies were especially noticeable regarding alignments to pseudogenes and novel splice junctions. Our results indicate that the choice of alignment strategy is important because it might affect the biological interpretation of the data.

Download data

  • Downloaded 275 times
  • Download rankings, all-time:
    • Site-wide: 140,244
    • In bioinformatics: 10,791
  • Year to date:
    • Site-wide: 127,994
  • Since beginning of last month:
    • Site-wide: 118,238

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide