The power of latent semantic indexing in review retrieval

Date

2016-11-21

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Over the past decade, the Internet has enabled accesses of information to its users in the forms of web pages and digital documents with vast variety of topics. The rapidly increased amount of such documents has given birth to the need of effective approaches to automatically analyzing the texts in the document in order to retrieve most appropriate documents. Latent Semantic Indexing (LSI) is a well-known Information Retrieval technique that ranks the documents by their relevance to a given query. Unlike keyword matching, instead of scoring by exact matching of the word in both the document and the query, LSI makes use of the words that occur together in documents to capture the hidden related meanings among documents to words in the query and thus LSI can improve the ability to rank relevant documents. It does so by performing rank-reduced Singular Value Decomposition on a word-by-document matrix to preserve important information across documents while eliminating noise, transforming query to this lower dimension and identifying similarity of the query to each document. While LSI has been used widely, it assumes that query basis words must present in the documents. This thesis addresses the question on the power of LSI whether it can be improved. First, can we relax the assumption that query basis words must occur in the documents? As in practice, it is not necessary the case that this assumption would hold, especially when documents are very short. Second, since LSI uses association of word occurrences in the documents to improve querying semantics, will LSI maintain its power in document retrieval when dealing with rare common words or short documents such as tweets or online reviews? The main contribution of this thesis is the study of the power of LSI that leads to LSI+ (LSI-plus), an intelligent LSI method that allows us to rank relevant documents without the required assumptions that query words must appear in documents. Furthermore, even when the assumption holds, LSI+ performs better than that of LSI. The key element to the enhanced LSI+ is the use of domain-specific knowledge about the review query. In particular, we propose a method that employs human expertise knowledge to provide relevance scores to document words (as opposed to frequency of occurrences as in traditional LSI). The method is systematic and general. Thus, it can be automatically applied to other domains. The thesis illustrates the power of LSI+ by experimenting on reviews from the Yelp dataset to rank reviews relevant to the query. LSI+ gives promising results with area-under curve higher than those obtained from traditional LSI. The thesis concludes with plan for future work of LSI+.

Description

Keywords

Latent semantic indexing, Expert knowledge, Document retrieval, Review retrieval, Singular value decomposition, Semantic score, WordNet, Ontology

Citation