Big data stream analytics with AI techniques

Date

2019-12

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Today data are being continuously generated in large volume and high speed from multitude of heterogeneous sources (e.g., Internet of Things, smart cities, social media, healthcare, and financial applications) that create unbounded and ever-growing Big Data Streams. Analytics of Big data streams have to deal with limited time-constraint access, unpredicted and evolving sequential data that are potentially inconsistent and dynamically distributed. Because of these, even simple tasks such as preprocessing data in a batch environment can be challenging in a stream environment. Most current analytics cope with Big data stream by assuming the data being already preprocessed and keeping the preprocessed parameters fixed or manually adjusted for adapting machine learning models to changing stream data. But adapting the modeling alone without properly preprocessing the data has been shown to yield poor results. In fact, very few studies have been done on preprocessing of Big data streams. Among those that do, tightly couple preprocessing with modeling or learning tasks (e.g., classification) that limit the usage of the preprocessed data to specific purposes. Furthermore, most existing data preprocessing does not perform in parallel distributed environments. Data preprocessing is time consuming and critical to the results of analytics. This dissertation aims to enhance Big Data Stream analytics by alleviating the above issues of data preprocessing in two ways. First, by separating data preprocessing to be performed independently from data analytics, the preprocessed data can be used for general purposes. Second, being implemented as online adaptive mechanisms in distributed environments, the data preprocessing can handle fast and dynamic arrival data rates. To maximize the degree of autonomy of stream data preprocessing, our research exploits Artificial Intelligence (AI) techniques by using domain-specific knowledge and enabling the adaptability of stream data preprocessing. The dissertation presents three preprocessing approaches: one for textual and two for numerical data streams. The textual preprocessing approach investigates tweet analytics with hashtags, hyperlinked words popular for tweet retrieval. In particular, the proposed approach takes Big data stream of tweets and identifies its topic content of interest from hashtags. Although hashtags have been successfully used for sentiment analysis, using them to identify other non-sentiment topics is difficult due to the lack of hashtag semantics. Unlike other work that uses natural language processing tools/techniques to annotate the meaning of hashtags, this dissertation builds a domain-specific knowledge base of the topic concepts (referred to as ontology) and uses them to create a set of strong hashtag predictors of tweet topics. The proposed hybrid approach combines hashtags extracted from input tweet data and those driven by the ontology using an automated hashtag generation tool. The approach has been implemented as an online method on Apache Storm (a real-time distributed framework) for distributed and scalable processing. The two numerical preprocessing approaches focus on adaptive normalization, a common preprocessing task of numeric data streams. Both use a sliding window mechanism that has been used for stream data analytics but not for stream data normalization. The first proposed approach adapts the normalization to the data values using statistics in each sliding window of a fixed size. This is suitable for changing data streams that have steady rates. The second adapts the size of the sliding window according to the data rates. This makes sure that the preprocessing keeps up with high speed data to provide up-to-date data for real-time applications. To evaluate these approaches require experiments using both real-world and synthetic data. These approaches along with experimental results will be described in details.

Description

Keywords

Big data stream, Apache storm, Pre-processing, Artificial intelligence

Citation