Rxivist logo

Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data

By Runmin Wei, Jingye Wang, Mingming Su, Erik Jia, Tianlu Chen, Yan Ni

Posted 17 Aug 2017
bioRxiv DOI: 10.1101/171967 (published DOI: 10.1038/s41598-017-19120-0)

Introduction: Missing values exist widely in mass-spectrometry (MS) based metabolomics data. Various methods have been applied for handling missing values, but the selection of methods can significantly affect following data analyses and interpretations. According to the definition, there are three types of missing values, missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Objectives: The aim of this study was to comprehensively compare common imputation methods for different types of missing values using two separate metabolomics data sets (977 and 198 serum samples respectively) to propose a strategy to deal with missing values in metabolomics studies. Methods: Imputation methods included zero, half minimum (HM), mean, median, random forest (RF), singular value decomposition (SVD), k-nearest neighbors (kNN), and quantile regression imputation of left-censored data (QRILC). Normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate the imputation accuracy for MCAR/MAR and MNAR correspondingly. Principal component analysis (PCA)/partial least squares (PLS)-Procrustes sum of squared error were used to evaluate the overall sample distribution. Student's t-test followed by Pearson correlation analysis was conducted to evaluate the effect of imputation on univariate statistical analysis. Results: Our findings demonstrated that RF imputation performed the best for MCAR/MAR and QRILC was the favored one for MNAR. Conclusion Combining with "modified 80% rule", we proposed a comprehensive strategy and developed a public-accessible web-tool for missing value imputation in metabolomics data.

Download data

  • Downloaded 866 times
  • Download rankings, all-time:
    • Site-wide: 15,391 out of 94,912
    • In bioinformatics: 2,367 out of 8,837
  • Year to date:
    • Site-wide: 42,853 out of 94,912
  • Since beginning of last month:
    • Site-wide: 22,647 out of 94,912

Altmetric data


Downloads over time

Distribution of downloads per paper, site-wide


PanLingua

Sign up for the Rxivist weekly newsletter! (Click here for more details.)


News