Machine learning models in electronic health records can outperform conventional survival models for predicting patient mortality in coronary artery disease
Andrew J. Steele,
S. Aylin Cakiroglu,
Anoop D. Shah,
Nicholas M Luscombe
Posted 30 Jan 2018
bioRxiv DOI: 10.1101/256008 (published DOI: 10.1371/journal.pone.0202344)
Posted 30 Jan 2018
Prognostic modelling is important in clinical practice and epidemiology for patient management and research. Electronic health records (EHR) provide large quantities of data for such models, but conventional epidemiological approaches require significant researcher time to implement. Expert selection of variables, fine-tuning of variable transformations and interactions, and imputing missing values in datasets are time-consuming and could bias subsequent analysis, particularly given that missingness in EHR is both high, and may carry meaning. Using a cohort of over 80,000 patients from the CALIBER programme, we performed a systematic comparison of several machine-learning approaches in EHR. We used Cox models and random survival forests with and without imputation on 27 expert-selected variables to predict all-cause mortality. We also used Cox models, random forests and elastic net regression on an extended dataset with 586 variables to build prognostic models and identify novel prognostic factors without prior expert input. We observed that data-driven models used on an extended dataset can outperform conventional models for prognosis, without data preprocessing or imputing missing values, and with no need to scale or transform continuous data. An elastic net Cox regression based with 586 unimputed variables with continuous values discretised achieved a C-index of 0.801 (bootstrapped 95% CI 0.799 to 0.802), compared to 0.793 (0.791 to 0.794) for a traditional Cox model comprising 27 expert-selected variables with imputation for missing values. We also found that data-driven models allow identification of novel prognostic variables; that the absence of values for particular variables carries meaning, and can have significant implications for prognosis; and that variables often have a nonlinear association with mortality, which discretised Cox models and random forests can elucidate. This demonstrates that machine-learning approaches applied to raw EHR data can be used to build reliable models for use in research and clinical practice, and identify novel predictive variables and their effects to inform future research.
- Downloaded 2,851 times
- Download rankings, all-time:
- Site-wide: 4,133
- In epidemiology: 383
- Year to date:
- Site-wide: 61,609
- Since beginning of last month:
- Site-wide: None
Downloads over time
Distribution of downloads per paper, site-wide
- 27 Nov 2020: The website and API now include results pulled from medRxiv as well as bioRxiv.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!