Rxivist logo

Sequential compression of gene expression across dimensionalities and methods reveals no single best method or dimensionality

By Gregory P. Way, Michael Zietz, Vincent Rubinetti, Daniel S. Himmelstein, Casey S. Greene

Posted 11 Mar 2019
bioRxiv DOI: 10.1101/573782

Background Unsupervised compression algorithms applied to gene expression data extract latent, or hidden, signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically-appropriate latent dimensionality. In practice, most researchers select a single algorithm and latent dimensionality. We sought to determine the extent by which using multiple dimensionalities across ensemble compression models improves biological representations. Results We compressed gene expression data from three large datasets consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. We compressed these data into many latent dimensionalities ranging from 2 to 200. We observed various tradeoffs across latent dimensionalities and compression models. For example, we observed high model stability between principal components analysis (PCA), independent components analysis (ICA), and non-negative matrix factorization (NMF). We identified more unique biological signatures in ensembles of denoising autoencoder (DAE) and variational autoencoder (VAE) models in intermediate latent dimensionalities. However, we captured the most pathway-associated features using all compressed features across algorithms and dimensionalities. Optimized at different latent dimensionalities, compression models detect generalizable gene expression signatures representing sex, neuroblastoma MYCN amplification, and cell types. In two supervised machine learning tasks, compressed features optimized predictions at different latent dimensionalities. Conclusions There is no single best latent dimensionality or compression algorithm for analyzing gene expression data. Instead, using feature ensembles from different compression models across latent space dimensionalities optimizes biological representations.

Download data

  • Downloaded 1,370 times
  • Download rankings, all-time:
    • Site-wide: 7,988 out of 101,349
    • In bioinformatics: 1,352 out of 9,295
  • Year to date:
    • Site-wide: 14,045 out of 101,349
  • Since beginning of last month:
    • Site-wide: 22,504 out of 101,349

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide


Sign up for the Rxivist weekly newsletter! (Click here for more details.)


  • 20 Oct 2020: Support for sorting preprints using Twitter activity has been removed, at least temporarily, until a new source of social media activity data becomes available.
  • 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
  • 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
  • 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
  • 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
  • 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
  • 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
  • 22 Jan 2019: Nature just published an article about Rxivist and our data.
  • 13 Jan 2019: The Rxivist preprint is live!