Rxivist logo

Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 67,594 bioRxiv papers from 298,187 authors.

Electronic Health Records Based Prediction of Future Incidence of Alzheimers Disease Using Machine Learning

By Ji Hwan Park, Han Eol Cho, Jong Hun Kim, Melanie Wall, Yaakov Stern, Hyunsun Lim, Shinjae Yoo, Hyoung-Seop Kim, Jiook Cha

Posted 02 May 2019
bioRxiv DOI: 10.1101/625582

Background: Accurate prediction of future incidence of Alzheimers disease may facilitate intervention strategy to delay disease onset. Existing AD risk prediction models require collection of biospecimen (genetic, CSF, or blood samples), cognitive testing, or brain imaging. Conversely, EHR provides an opportunity to build a completely automated risk prediction model based on individuals history of health and healthcare. We tested machine learning models to predict future incidence of AD using administrative EHR in individuals aged 65 or older. Methods: We obtained de-identified EHR from Korean elders age above 65 years old (N=40,736) collected between 2002 and 2010 in the Korean National Health Insurance Service database system. Consisting of Participant Insurance Eligibility database, Healthcare Utilization database, and Health Screening database, our EHR contain 4,894 unique clinical features including ICD-10 codes, medication codes, laboratory values, history of personal and family illness, and socio-demographics. Our event of interest was new incidence of AD defined from the EHR based on both AD codes and prescription of anti-dementia medication. Two definitions were considered: a more stringent one requiring a diagnosis and dementia medication resulting in n=614 cases (definite AD) and a more liberal one requiring only diagnostic codes (n=2,026; probable AD). We trained and validated a random forest, support vector machine, and logistic regression to predict incident AD in 1,2,3, and 4 subsequent years using the EHR available since 2002. The length of the EHR used in the models ranged from 1,571 to 2,239 days. Model training, validation, and testing was done using iterative (5 times), nested, stratified 5-fold cross validation. Results: Average duration of EHR was 1,936 days in AD and 2,694 days in controls. For predicting future incidence of AD using the definite AD outcome, the machine learning models showed the best performance in 1 year prediction with AUC of 0.781; in 2 year, 0.739; in 3 year, 0.686; in 4 year, 0.662. Using probable AD outcome, the machine learning models showed the best performance in 1 year prediction with AUC of 0.730; in 2 year, 0.645; in 3 year, 0.575; in 4 year, 0.602. Important clinical features selected in logistic regression included hemoglobin level (b=-0.902), age (b=0.689), urine protein level (b=0.303), prescription of Lodopin (antipsychotic drug) (b=0.303), and prescription of Nicametate Citrate (vasodilator) (b=-0.297). Conclusion: This study demonstrates that EHR can detect risk for incident AD. This approach could enable risk-specific stratification of elders for better targeted clinical trials.

Download data

  • Downloaded 188 times
  • Download rankings, all-time:
    • Site-wide: 50,428 out of 67,591
    • In bioinformatics: 5,527 out of 6,655
  • Year to date:
    • Site-wide: 27,987 out of 67,591
  • Since beginning of last month:
    • Site-wide: 21,094 out of 67,591

Altmetric data


Downloads over time

Distribution of downloads per paper, site-wide


Sign up for the Rxivist weekly newsletter! (Click here for more details.)


News