A comparative feature selection study: Predicting Alzheimer's disease using primary healthcare and social services data

Research output: Contribution to journalArticleScientificpeer-review

Abstract

This study investigates the use of different feature selection techniques to improve the performance of machine learning (ML) models for the early prediction of Alzheimer's disease (AD), using primary healthcare and social services data from a cohort of 26,828 residents aged 65 years and older in Kuopio, Finland. We compared pre-classifier feature selection approaches such as analysis of variance (ANOVA) and mutual information (MI) and post-classifier approaches such as SHapley Additive exPlanations (SHAP). We assessed six ML models, with feature selection improving performance over using all features; XGBoost achieved the highest AUC (0.755) and Logistic Regression the highest balanced accuracy (0.668) using 50 SHAP-selected features, 3–4 years before clinical confirmation of the disease. The most predictive features originated from primary healthcare, particularly ICPC and ICD-10 codes for dementia and mild cognitive impairment. The results underscore the importance of feature selection for improving both performance and interpretability in early AD prediction and also highlighting the need to tailor feature selection to the ML model and dataset characteristics. The contribution of our work lies in integrating primary healthcare data with social services data for AD prediction, not previously explored by prior studies. Moreover, while most studies relied on a single feature selection approach, we conduct a comparison of various approaches to identify most effective methods for capturing AD risk factors. Future work should address the limitations of this study, including parameter optimization, data imbalance, small AD sample sizes, single geographic cohort, and additional features such as imaging biomarkers to enhance prediction.
Original languageEnglish
Article number101703
JournalInformatics in Medicine Unlocked
Volume59
DOIs
Publication statusPublished - Oct 2025
MoE publication typeA1 Journal article-refereed

Funding

This project was partly funded by The European Regional Development Fund (ERDF) through The North Savo Regional Council, Finland (EURA 2014/8894/09 02 01 01/2019/PSL) and VTT internal funding.

Keywords

  • Alzheimer's disease
  • Electronic health record
  • Feature selection
  • Machine learning
  • Primary healthcare

Fingerprint

Dive into the research topics of 'A comparative feature selection study: Predicting Alzheimer's disease using primary healthcare and social services data'. Together they form a unique fingerprint.

Cite this