Fast data analysis framework for scientific big data applications



Journal Title

Journal ISSN

Volume Title



Scientific breakthroughs are increasingly powered by advanced computing and data analysis capabilities. The data-driven scientific discovery has become the new fourth paradigm of scientific innovation after theory, experiment, and simulation driven innovations. The data-driven scientific discovery is based upon advanced high performance computing (HPC) that traditionally powers simulation driven research and further requires processing massive amounts of datasets. Revealing and exploring the interesting knowledge hidden inside scientific datasets faces critical challenges and the problem is beyond the capability of traditional HPC software systems. Not only the existing data and computing model, but also the runtime and the storage architecture in the HPC, need to be revisited to meet the "big data" challenges. The fundamental issue is the data movement that often dominates the overall analysis performance and the execution time. To optimize and speedup the discovery process, this dissertation research studies the scientific workflow and designs a Fast Data Analysis Framework, that builds a top-down high performance computing framework with a focus on reducing the data movement and boosting the scientific big data processing. This framework has a newly designed Statistical Data model with integrated statistics and subsetting schemes to speedup the query analysis. The framework also has an In-Advance Computing model, which is designed to better support generic scientific analysis routines. This computing model has a flexible two-level design, with a coarse-grain level that performs optimizations at the analysis operation level and a fine-grain level that moves computations in-advance to produce partial analysis results on the incomplete I/O (input/output) streams. The two-level model can work independently and coherently to utilize computation in-advance to reduce the required data movement. The framework further supports a Hierarchical Runtime Scheduling, which considers both the storage side I/O queues and the client side data redistribution, and an Automatic Storage Reorganization, which allows maintaining multiple data layouts and automatically redirecting the I/O to better layouts. These designs and developments provide novel algorithms to optimize the data movement in both runtime and file systems. The evaluation results confirm that the Fast Data Analysis framework improved and optimized scientific big data applications. It can have an impact on the the design and development of future infrastructures, algorithms, and systems for data-driven scientific discovery paradigm.



Scientific Data Management, Parallel Computing, File System, In-Memory Processing, In-Advance Computing Model