Refereed Papers

Track: Data Mining

Paper Title:
Detecting Near-Duplicates for Web Crawling

Authors:

Gurmeet Manku (Google Inc.)
Arvind Jain (Google Inc.)
Anish Das Sarma (Stanford University)

Abstract:
Near-duplicate documents are commonly found on the web. A pair of near-duplicate web pages differ from each other in a very small portion. The differences commonly consist of advertisements and timestamps. Such differences are irrelevant for web search. During web crawling, it is useful to quickly ascertain whether a newly crawled web page is a near-duplicate of a previously crawled web page or not.

In the course of developing a practical system for near-duplicate detection, we make two research contributions. First, we demonstrate the effectiveness of Charikar's fingerprinting technique for identifying near-duplicate web pages. We show that for 8 billion web-pages, a good choice of parameters is 64-bit fingerprints and 3-bit Hamming distances. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.

PDF version