Exploring in-memory HDF5 and early evaluations

Abstract

Many scientific big data applications have iterative computations and can re-use the results from previous stages in their workflow. HDF5 is one such library that provides scientists with wide range of facilities to perform the scientific data management and computation. Like all the other existing scientific I/O libraries, HDF5 library is an entirely disk based model where the results from various stages of computation are always stored in disk. In our research, we propose to place the results in-memory and to re-use them for the future requests to avoid expensive disk accesses. As the data is residing in-memory, it is not persistent. In order to provide persistence and avoid disk read in such scenarios, lineage information that includes the source dataset and the computation that resulted in the current dataset are stored in memory. Lineage information is captured in a metadata structure as attributes of the dataset for each data block in-memory by intercepting the IO call. This captured lineage metadata can be used to re-compute the dataset without reading the disk. We have evaluated our in-memory architecture with different IO patterns, where the contiguous IO pattern proved to be efficient in a linear fashion, whereas the efficiency of non-contiguous IO pattern remains unpredictable. In addition, we have evaluated our lineage tracking module over the traditional disk based approach for re-constructing the lost datasets.

Description

Keywords

In-memory, Lineage information, HDF5 library

Citation