Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: A comparative study

Marietta Kokla (Corresponding Author), Jyrki Virtanen, Marjukka Kolehmainen, Jussi Paananen, Kati Hanhineva

Research output: Contribution to journalArticleScientificpeer-review

Abstract

Background: LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. Therefore, there is a need for proper imputation strategies that account for the missingness and reduce the bias in the statistical analysis. Results: Here we present our results after evaluating nine imputation methods in four different percentages of missing values of different origin. The performance of each imputation method was analyzed by Normalized Root Mean Squared Error (NRMSE). We demonstrated that random forest (RF) had the lowest NRMSE in the estimation of missing values for Missing at Random (MAR) and Missing Completely at Random (MCAR). In case of absent values due to Missing Not at Random (MNAR), the left truncated data was best imputed with minimum value imputation. We also tested the different imputation methods for datasets containing missing data of various origin, and RF was the most accurate method in all cases. The results were obtained by repeating the evaluation process 100 times with the use of metabolomics datasets where the missing values were introduced to represent absent data of different origin. Conclusion: Type and rate of missingness affects the performance and suitability of imputation methods. RF-based imputation method performs best in most of the tested scenarios, including combinations of different types and rates of missingness. Therefore, we recommend using random forest-based imputation for imputing missing metabolomics data, and especially in situations where the types of missingness are not known in advance.

Original languageEnglish
Article number492
JournalBMC Bioinformatics
Volume20
Issue number1
DOIs
Publication statusPublished - 11 Oct 2019
MoE publication typeA1 Journal article-refereed

Fingerprint

Metabolomics
Random Forest
Imputation
Comparative Study
Statistical methods
Missing Values
Metabolites
Aberrations
Missing at Random
Mean Squared Error
Statistical Analysis
Roots
Missing Completely at Random
Truncated Data
Profiling
Aberration
Missing Data
Technology
Percentage
Lowest

Keywords

  • High dimensional data
  • Imputation
  • MAR
  • MCAR
  • Metabolomics
  • Missing values
  • MNAR
  • RF

Cite this

Kokla, Marietta ; Virtanen, Jyrki ; Kolehmainen, Marjukka ; Paananen, Jussi ; Hanhineva, Kati. / Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data : A comparative study. In: BMC Bioinformatics. 2019 ; Vol. 20, No. 1.
@article{dcc4dbb01d0841c7bb18b8ab1c375a03,
title = "Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: A comparative study",
abstract = "Background: LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. Therefore, there is a need for proper imputation strategies that account for the missingness and reduce the bias in the statistical analysis. Results: Here we present our results after evaluating nine imputation methods in four different percentages of missing values of different origin. The performance of each imputation method was analyzed by Normalized Root Mean Squared Error (NRMSE). We demonstrated that random forest (RF) had the lowest NRMSE in the estimation of missing values for Missing at Random (MAR) and Missing Completely at Random (MCAR). In case of absent values due to Missing Not at Random (MNAR), the left truncated data was best imputed with minimum value imputation. We also tested the different imputation methods for datasets containing missing data of various origin, and RF was the most accurate method in all cases. The results were obtained by repeating the evaluation process 100 times with the use of metabolomics datasets where the missing values were introduced to represent absent data of different origin. Conclusion: Type and rate of missingness affects the performance and suitability of imputation methods. RF-based imputation method performs best in most of the tested scenarios, including combinations of different types and rates of missingness. Therefore, we recommend using random forest-based imputation for imputing missing metabolomics data, and especially in situations where the types of missingness are not known in advance.",
keywords = "High dimensional data, Imputation, MAR, MCAR, Metabolomics, Missing values, MNAR, RF",
author = "Marietta Kokla and Jyrki Virtanen and Marjukka Kolehmainen and Jussi Paananen and Kati Hanhineva",
year = "2019",
month = "10",
day = "11",
doi = "10.1186/s12859-019-3110-0",
language = "English",
volume = "20",
journal = "BMC Bioinformatics",
issn = "1471-2105",
number = "1",

}

Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data : A comparative study. / Kokla, Marietta (Corresponding Author); Virtanen, Jyrki; Kolehmainen, Marjukka; Paananen, Jussi; Hanhineva, Kati.

In: BMC Bioinformatics, Vol. 20, No. 1, 492, 11.10.2019.

Research output: Contribution to journalArticleScientificpeer-review

TY - JOUR

T1 - Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data

T2 - A comparative study

AU - Kokla, Marietta

AU - Virtanen, Jyrki

AU - Kolehmainen, Marjukka

AU - Paananen, Jussi

AU - Hanhineva, Kati

PY - 2019/10/11

Y1 - 2019/10/11

N2 - Background: LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. Therefore, there is a need for proper imputation strategies that account for the missingness and reduce the bias in the statistical analysis. Results: Here we present our results after evaluating nine imputation methods in four different percentages of missing values of different origin. The performance of each imputation method was analyzed by Normalized Root Mean Squared Error (NRMSE). We demonstrated that random forest (RF) had the lowest NRMSE in the estimation of missing values for Missing at Random (MAR) and Missing Completely at Random (MCAR). In case of absent values due to Missing Not at Random (MNAR), the left truncated data was best imputed with minimum value imputation. We also tested the different imputation methods for datasets containing missing data of various origin, and RF was the most accurate method in all cases. The results were obtained by repeating the evaluation process 100 times with the use of metabolomics datasets where the missing values were introduced to represent absent data of different origin. Conclusion: Type and rate of missingness affects the performance and suitability of imputation methods. RF-based imputation method performs best in most of the tested scenarios, including combinations of different types and rates of missingness. Therefore, we recommend using random forest-based imputation for imputing missing metabolomics data, and especially in situations where the types of missingness are not known in advance.

AB - Background: LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. Therefore, there is a need for proper imputation strategies that account for the missingness and reduce the bias in the statistical analysis. Results: Here we present our results after evaluating nine imputation methods in four different percentages of missing values of different origin. The performance of each imputation method was analyzed by Normalized Root Mean Squared Error (NRMSE). We demonstrated that random forest (RF) had the lowest NRMSE in the estimation of missing values for Missing at Random (MAR) and Missing Completely at Random (MCAR). In case of absent values due to Missing Not at Random (MNAR), the left truncated data was best imputed with minimum value imputation. We also tested the different imputation methods for datasets containing missing data of various origin, and RF was the most accurate method in all cases. The results were obtained by repeating the evaluation process 100 times with the use of metabolomics datasets where the missing values were introduced to represent absent data of different origin. Conclusion: Type and rate of missingness affects the performance and suitability of imputation methods. RF-based imputation method performs best in most of the tested scenarios, including combinations of different types and rates of missingness. Therefore, we recommend using random forest-based imputation for imputing missing metabolomics data, and especially in situations where the types of missingness are not known in advance.

KW - High dimensional data

KW - Imputation

KW - MAR

KW - MCAR

KW - Metabolomics

KW - Missing values

KW - MNAR

KW - RF

UR - http://www.scopus.com/inward/record.url?scp=85073099100&partnerID=8YFLogxK

U2 - 10.1186/s12859-019-3110-0

DO - 10.1186/s12859-019-3110-0

M3 - Article

C2 - 31601178

AN - SCOPUS:85073099100

VL - 20

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - 1

M1 - 492

ER -