Machine Learning Prediction of Incidence of Alzheimer’s Disease Using Large-Scale Administrative Health Data
Ji Hwan Park,
Han Eol Cho,
Jong Hun Kim,
Hyoung Seop Kim,
Posted 02 May 2019
bioRxiv DOI: 10.1101/625582 (published DOI: 10.1038/s41746-020-0256-0)
Posted 02 May 2019
Nationwide population-based cohort provides a new opportunity to build a completely automated risk prediction model based on individuals’ history of health and healthcare beyond existing risk prediction models. We tested the possibility of machine learning models to predict future incidence of Alzheimer’s disease (AD) using large-scale administrative health data. From the Korean National Health Insurance Service database between 2002 and 2010, we obtained de-identified health data in elders above 65 years (N=40,736) containing 4,894 unique clinical features including ICD-10 codes, medication codes, laboratory values, history of personal and family illness, and socio-demographics. To define incident AD two operational definitions were considered: “definite AD” with diagnostic codes and dementia medication (n=614) and “probable AD” with only diagnosis (n=2,026). We trained and validated a random forest, support vector machine, and logistic regression to predict incident AD in 1,2,3, and 4 subsequent years. For predicting future incidence of AD in balanced samples (bootstrapping), the machine learning models showed reasonable performance in 1-year prediction with AUC of 0.775 and 0.759, based on “definite AD” and “probable AD” outcomes, respectively; in 2-year, 0.730 and 0.693; in 3-year, 0.677 and 0.644; in 4-year, 0.725 and 0.683. The results were similar when the entire (unbalanced) samples were used. Important clinical features selected in logistic regression included hemoglobin level, age, and urine protein level. This study may shed a light on the utility of the data-driven machine learning model based on large-scale administrative health data in AD risk prediction, which may enable better selection of individuals at risk for AD in clinical trials or early detection in clinical settings.
- Downloaded 440 times
- Download rankings, all-time:
- Site-wide: 38,930 out of 94,912
- In bioinformatics: 4,766 out of 8,837
- Year to date:
- Site-wide: 16,875 out of 94,912
- Since beginning of last month:
- Site-wide: 37,892 out of 94,912
Downloads over time
Distribution of downloads per paper, site-wide
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!