Development and Validation of Machine-Learning Algorithms to Predict the Onset of Depression Using Electronic Health Record Data: A Prognostic Modeling Study.

Chen, Frances R, James L Huang, Debbie L Wilson, and Wei-Hsuan Jenny Lo-Ciganic. 2025. “Development and Validation of Machine-Learning Algorithms to Predict the Onset of Depression Using Electronic Health Record Data: A Prognostic Modeling Study.”. Studies in Health Technology and Informatics 329: 997-1001.

Abstract

INTRODUCTION: Early detection and intervention are crucial for reducing the impacts of depression and associated healthcare costs. Few studies have used electronic health records (EHR) and machine learning (ML) with a longitudinal design to predict depression onset. We developed and validated ML algorithms using EHR to identify patients at high risk for the onset of diagnosis-based major depressive disorder (MDD) in primary care settings.

METHODS: Using a prognostic modeling approach with retrospective cohort study design, we identified patient visits in primary care settings for individuals aged ≥18 years from the Accelerating Data Value Across a National Community Health Center Network Clinical Research Network 2015-2021 data. We measured 267 features at six-month intervals starting six months prior to the first encounter. We developed algorithms using Least Absolute Shrinkage and Selection Operator (LASSO), random forest, and XGBoost with 10-fold cross validation. Using hold-out testing data, we measured prediction performance (e.g., C-statistics), stratified patients into decile risk subgroups, and assessed model biases.

RESULTS: Among eligible 1,965,399 individuals (mean age = 43.52 ± 16.04 years; male = 35%; African American = 20%) with 4,985,280 person-periods, the MDD onset rate was 1% during the study period. XGBoost performed similarly to other models and had the fewest predictors, (C-statistic = 0.763, 95% CI = [0.760, 0.767]). XGBoost had a 66.78% sensitivity, 74.19% specificity, and 2.55% positive predictive value at the balanced threshold identified using Youdan Index. The top three risk decile subgroups captured ∼70% of MDD cases, without significant racial or sex biases.

CONCLUSIONS: An ML algorithm using EHR data can effectively identify individuals at high risk of depression onset within the subsequent six months, without exacerbating racial or sex biases, providing a valuable tool for targeted early interventions.

Last updated on 08/08/2025
PubMed