Scaling MapReduce-based big data analytics algorithms with AI inspired controls
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
A massive volume of Big data increases complexity. Even simple computing tasks can become problematic or too costly to be computationally realistic. Most data analytics, particularly machine learning algorithms, are initially designed for in-memory data on a single machine. Analyzing Big data with these algorithms becomes cumbersome and, in some cases, unachievable. A natural remedy is to employ parallel and distributed computing to scale these algorithms to "Big data analytics algorithms" in order to cope with Big data. However, this dissertation shows that parallelism alone is not necessarily enough. Many Big data analytics algorithms have been built in a popular MapReduce framework using MapReduce, a programming paradigm that enables parallel and distributed execution of massive data processing on large clusters of machines. Most existing MapReduce-based analytics algorithms are "naïve" in that they mimic process flows in sequential algorithms and focus on techniques for efficient parallelization by using appropriate keys or representations. While these are useful, little work has been done to gain understanding of what makes one MapReduce-based analytics algorithm more efficient than the other. What are the underlying characteristics? Do all "naïve" MapReduce-based analytics algorithms yield optimal performance and how can we design algorithms to fully take advantage of distributed and parallel environments? To answer these questions, this dissertation explores state-of-the-arts of families of MapReduce-based analytics algorithms to explicate fundamental properties that optimally exploit parallelism. MapReduce enables Map and Reduce tasks to be executed on different machine nodes in parallel. The nodes that have finished their tasks but have to wait for others to complete before they can start the next task can result in poor performance. In AI (Artificial Intelligence), an intelligent agent cannot always act as planned but to immediately respond to dynamically changing environments. This is facilitated by an opportunistic control, where runtime situations trigger execution of a subset of actions. Like AI control, MapReduce algorithms that execute opportunistically can minimize the wait and convocation delay and thus, improve performance. When a MapReduce-based algorithm executes completely with an opportunistic control, the algorithm behaves as a stateless algorithm. The dissertation further explores these concepts and proposes new paradigms as well as new algorithms that apply this concept paradigm. The contributions of this dissertation includes: 1) Studies that elucidate "statelessness" as a fundamental property of MapReduce-based Analytics algorithms to optimally exploit parallelism, 2) A new paradigm for MapReduce-based Analytics algorithms design/development that aims to maximize statelessness by AI inspired controls that employ opportunistic executions of Map and Reduce workers, 3) Two new MapReduce-based Apriori algorithms for Association Rule Mining: one is stateless with opportunistic execution and the other is its extension to handle limited memory in practice, 4) A new MapReduce-based Decision Tree Learning algorithms with opportunistic execution, and 5) Illustrations of applications in various domains including Health analytics. The dissertation presents details of these contributions along with experimental results. The proposed new algorithms in "stateless" paradigm significantly improve execution times over those of the "naïve" (or traditional) MapReduce-based approaches. For example, on 10% support threshold, the proposed MapReduce-based Apriori has an average reduction of the execution time of about 70%, over 50,000-200,000 transactions on one to three machines. The execution time of the proposed MapReduce-based Apriori algorithm is shown to be "better than linear" in the number of transactions.
Embargo status: Restricted until January 2026. To request access, click on the PDF link to the left.