An investigation of machine learning algorithms and data augmentation techniques for diabetes diagnosis using class imbalanced BRFSS dataset

Abstract

Diabetes is a prevalent chronic condition that poses significant challenges to early diagnosis and identifying at-risk individuals. Machine learning plays a crucial role in diabetes detection by leveraging its ability to process large volumes of data and identify complex patterns. However, imbalanced data, where the number of diabetic cases is substantially smaller than non-diabetic cases, complicates the identification of individuals with diabetes using machine learning algorithms. This study focuses on predicting whether a person is at risk of diabetes, considering the individual's health and socio-economic conditions while mitigating the challenges posed by imbalanced data. We employ several data augmentation techniques, such as oversampling (Synthetic Minority Over Sampling for Nominal Data, i.e.SMOTE-N), undersampling (Edited Nearest Neighbor, i.e. ENN), and hybrid sampling techniques (SMOTE-Tomek and SMOTE-ENN) on training data before applying machine learning algorithms to minimize the impact of imbalanced data. Our study sheds light on the significance of carefully utilizing data augmentation techniques without any data leakage to enhance the effectiveness of machine learning algorithms. Moreover, it offers a complete machine learning structure for healthcare practitioners, from data obtaining to machine learning prediction, enabling them to make informed decisions.

Description

© 2023 The Author(s) cc-by-nc-nd

Keywords

Behavioral Risk Factor Surveillance System (BRFSS), Classification, Diabetes, Imbalanced data, Machine learning, Nominal data, Sampling

Citation

Chowdhury, M.M., Ayon, R.S., & Hossain, M.S.. 2024. An investigation of machine learning algorithms and data augmentation techniques for diabetes diagnosis using class imbalanced BRFSS dataset. Healthcare Analytics, 5. https://doi.org/10.1016/j.health.2023.100297

Collections