Clustering and fast search of high dimensional big data
Similarity Search aims to extract the most similar objects to a given query which is very useful for many information retrieval applications. For big data, with different kinds of data from various resources, the search efficiency is as important as search quality. The importance of fast search operation motivates us to have a research for this main. To enable fast search, data needs to be organized/indexed, and clustering is one way to do it. For clustering, to be effective for search, firstly, we need to identify an appropriate number of clusters for the clustering, and secondly, develop a fast-easy-to use clustering procedure which allows evaluation of the software. My ph.D research consists of two parts to focus on two significant clustering aspects. The first part is a fast-easy-to-use clustering procedure with automatic determination of cluster numbers. We used clustering as a framework to organize data with high dimensions. Indeed, the hierarchical indexing struc- ture, similarity search-tree (SS-Tree)(White and Jain, 1996), and the clustering algorithm implemented. We also investigated a problem in clustering algorithms to determine the number of clusters; while in the second part, I mainly concen- trated on clustering in terms of facilitating search, as clustering is one of the most important method to reduce the cost of similarity search from the viewpoint of CPU and disk processing. We proposed a new structure to have more efficient search in big data with high dimensions and compared the new method with the existing tree-based structure in terms of computation time when finding the most similar data points in given ranges. The results indicates linear organization data set outperforms tree based structure when it comes to high dimensional big data.