Measuring Semantic Similarity between Words Using Web Search Engines
Authors:
Danushka Bollegala (The University of Tokyo)
Yutaka Matsuo (AIST)
Mitsuru Ishizuka (The University of Tokyo)
Abstract:
Semantic similarity measures play important roles in information retrieval and Natural Language Processing. In information retrieval, semantic similarity measures are used in automatic query suggestion and expansion. Previous work in semantic web-related applications such as community mining, relation extraction, automatic meta data extraction have used various semantic similarity measures. Despite the usefulness of semantic similarity measures in these applications, robustly measuring semantic similarity between two words (or entities) remains a challenging task. Semantic similarity is a dynamic phenomenon that changes over time and across domains. In this paper, we propose a robust semantic similarity measure that uses the information available on the Web to measure similarity between words or entities. We propose a method that exploits page counts and text snippets returned by a Web search engine to measure semantic similarity between words. We define various similarity scores for two given words \textit{P} and \textit{Q}, using the page counts for the queries \textit{P}, \textit{Q} and \textit{P AND Q}. Moreover, we propose a novel approach to compute semantic similarity using automatically extracted lexico-syntactic patterns from text snippets. These different similarity scores are integrated using support vector machines, to leverage a robust semantic similarity measure. Experimental results on Miller-Charles benchmark dataset show that the proposed measure outperforms all the existing web-based semantic similarity measures by a wide margin, achieving a correlation coefficient of $0.834$. Moreover, the proposed semantic similarity measure significantly improves the accuracy ($F$-measure of $0.78$) in a community mining task, and improves accuracy in a entity disambiguation task, thereby verifying the capability of the proposed measure to capture semantic similarity using web content.
Slot:
New Brunswick, Friday, May 11, 2007, 1:30pm to 3:00pm.