Efficient scientific data discovery over self-describing file formats



Journal Title

Journal ISSN

Volume Title



Scientific experiments and observations store massive amounts of data in various scientific file formats. Among these file formats, self-describing formats allow storing metadata to describe data objects. As a result, self-describing file formats are commonly used by scientists for sifting through the massive data and locating the data of interest. Regrettably, searching the metadata within these files efficiently remains challenging due to the sheer size of data. Existing solutions extract the metadata from files and store the metadata in external database management systems (DBMS) to search and then to locate desired data in the files. Such practice introduces significant overhead and complexity, hinders transparency and portability, and defeats the principle of self-describing data management solution. Therefore, developing an integrated metadata search solution that can be built into the software stack of self-describing file formats is necessary.

However, it is a challenging task to develop such a metadata search solution as there are still many problems remaining unknown or unaddressed. First of all, towards accelerating the search of metadata to locate target data, many data structures can be used for indexing the metadata. These include hash tables, self-balancing search trees, skip lists, sparse array, etc. Efficient selection of an indexing data structure is challenging in the context of scientific data management, due to the lack of investigation of metadata, metadata queries, and indexing methodologies. Secondly, it is necessary to develop an efficient metadata indexing solution to facilitate the metadata search. Such a solution, however, needs to deal with the in-memory index design, the index persistence mechanism, and query processing. Meanwhile, the metadata index must be designed to support the intra-process level parallelism that is strongly desired to cope with the data simultaneously collected from multiple data sources during the data generation process. Finally, it is necessary to consider how metadata search can be performed efficiently with load balance constraints in a distributed system.

To address these challenges, in this dissertation research, we first perform a systematic investigation of the metadata search essentials in the context of scientific data management. We study real-world datasets from cosmological observations and explore the characteristics of the metadata in the datasets. We investigate metadata queries and evaluate different data structures for various types of metadata attributes based on the characteristics. Our findings provided a guideline and offered insights to design and develop innovative metadata management methodologies for scientific applications. We also propose a novel Metadata Indexing and Querying for self-describing formats (MIQS), which removes the external DBMS and utilizes in-memory index to achieve efficient metadata searching. MIQS follows the self-contained data management paradigm and provides portable and schemafree metadata indexing and querying functionalities for self-describing file formats. We have compared MIQS with the MongoDB-based metadata search solution. MIQS achieved up to 99% time reduction in index construction and up to 172k× search performance improvement with up to 75% reduction in memory footprint. Additionally, our design maintains multiple parallel compound indexing trees with a configurable parallelism setting and is highly scalable. Our experimental results show up to 4.9× indexing throughput improvement and up to 3.3× search throughput improvement. Finally, with the goal of addressing the challenge of distributed affix-based keyword search on HPC systems, we propose Distributed Adaptive Radix Tree (DART). This trie-based approach is scalable in achieving efficient affix-based search and alleviating imbalanced keyword distribution and excessive requests on keywords at scale. Our evaluation at different scales shows that, comparing with the “full string hashing” use case of the most popular distributed indexing technique - Distributed Hash Table (DHT), DART achieves up to 55× better throughput with prefix and suffix searches, while achieving comparable throughput with exact and infix searches. Also, comparing to the “initial hashing” use case of DHT, DART maintains a balanced keyword distribution on distributed nodes and alleviates excessive query workload against popular keywords.

Embargo status: Restricted until 06/2022. To request the author grant access, click on the PDF link to the left.



Scientific Data Discovery, Metadata Index, Metadata Search