The more the merrier – application of machine learning processes on social determinants of health data to ameliorate the effects of chronic disease on individuals and communities



Journal Title

Journal ISSN

Volume Title



This dissertation aims to apply machine learning techniques to predict the costs and prevalence of chronic diseases and to identify novel features associated with those diseases. Our focus is on diabetes and mental health disorders, chronic, debilitating diseases that reduce length and quality of life and incur costs at all economic levels micro to macro. Machine learning techniques are efficient and accurate for large data sets, have many attractive modeling and feature selection options, and have a solid foundation when applied to medical issues. Chronic diseases have broad complex etiologies dependent on genetics, environment and often the interaction between the two. The chronic diseases this paper focuses on are diabetes (all forms) and debilitating mental illness (depression/anxiety/or an emotional problem, depression, and poor mental health status). The choice of these diseases, while personal, did ultimately depend on their high societal and economic disease burden. Diabetes and poor mental health status reduce long run economic growth and will continue to do so until effective tactics and policy address prevention or at least delay the status of late disease progression. We examine these diseases using machine learning techniques at the individual, geographic and chronological level. Chapter 2, How sweet is machine learning, uses R’s caret package to: A) identify the geographic determinants associated with diabetes prevalence and changes in prevalence at the census tract level and B) predict future diabetic hospitalization costs county level. Chapter 3, Time marches on, uses jupyter notebook’s pycaret machine learning specifications to A) to identify the traits associated with an individual’s status of having activity limitations stemming directly from depression/anxiety/emotional problems and B) assess if there are changes in these traits through time specifically 1997 – 2018. Chapter 4, Spike Protein Depression uses pycaret to A) examine the prevalence of Depression and poor mental health at the census tract level for years 2019, B) identify features associated with changes in depression tract level prevalence between 2019 and 2021 and C) forecast future county level depression associated hospitalization costs. The progression of the paper starts with highly supervised machine learning at the geographic level, then moves to a more streamlined and less supervised process to examine debilitating mental illness over time (1997 – 2018) and geographic depression prevalence before and after Covid-19 pandemic. Overall, the diabetes and depression geographic models performed well while the individually measured chronological severe mental health models were less precise. All features identified as important via Shapley values had strong literature foundations and were useful for analysis. The modeling process was efficient especially considering data dimensionality.



chronic disease, machine learning, gis, health economics, diabetes