Pattern Recognition and Analysis in Imbalanced Sequential Data
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In the fields of machine learning, deep learning, or large language models, it is common that the models being trained are often considered as a class representation using the training data. However, in real-world scenarios, the data are unevenly disturbed and highly skewed in the application domains such as finance, health, language, and cybersecurity. The class imbalance issues is a daunting challenge asking for developing a robust machine learning model.
The class imbalance problem has emerged as a priority research to develop methods so, the trained model can perform better for minority classes. For NLP, there exist techniques to address the problem including under or over-sampling, class weights, data augmentation, and regularization (i.e., penalize the model while training). For time series, techniques to remove trends and seasonality, weighting abnormal data through scoring metrics, and semi-supervised approaches have been employed. However, the uncertainty remains a challenge across the contexts. The ensemble technique for resampling and modification of algorithmic approaches has shown promising results where additional complexities are introduced.
This dissertation advances the state-of-the-art of dealing with class imbalance datasets using the emerging deep learning techniques to deal with anomaly detection in IoT time series data where data are imbalanced. First, the approach employs deep learning models such as LSTM, BI-LSTM, CuDNN-LSTM and TCN with various hyperparameters tuning to model the problem of anomaly detection. Second, the deep learning models when combined with data level approaches for data augmentation such as oversampling and under sampling, will treat textual data as a sequence and thus make them suitable for adapting numerical time series analysis. Lastly, this dissertation examines the class imbalance issues in textual data utilizing Large Language Model (LLM)-based data augmentation with controlled prompts. It examines the relative merits of the existing sampling techniques, regularization, and class weights at the algorithmic level. In this dissertation, we have adapted LLM-based data augmentation along with existing techniques at the data level, at the algorithmic level (i.e., regularization and class weights) for the comparative study. Our experiment demonstrates that LLM-based data augmentation has the potential to address the challenges posed by imbalanced datasets in real-world scenarios.
Embargo status: Restricted until 06/2027. To request the author grant access, click on the PDF link to the left.