![Top of Menu](images/menuTop.jpg)
![Home](images/menuHome.jpg)
![CFP](images/menuCfp.jpg)
![Program](images/menuProgramS.jpg)
![Committees](images/menuCommittee.jpg)
![Key Dates](images/menuKeyDates.jpg)
![Location](images/menuLocation.jpg)
![Hotel](images/menuHotel.jpg)
![Registration](images/menuRegistration.jpg)
![Students](images/menuStudents.jpg)
![Sponsors](images/menuSponsors.jpg)
![Media](images/menuMedia.jpg)
![Submission](images/menuSubmission.jpg)
![Tutorials](images/menuTutorial.jpg)
![Workshops](images/menuWorkshops.jpg)
![Travel Info](images/menuTravel.jpg)
![Proceedings](images/menuProceedings.jpg)
Track: Data Mining
Paper Title:
Detecting Near-Duplicates for Web Crawling
Authors:
Abstract:
Near-duplicate documents are commonly found on the web. A pair of
near-duplicate web pages differ from each other in a very small
portion. The differences commonly consist of advertisements and
timestamps. Such differences are irrelevant for web search. During
web crawling, it is useful to quickly ascertain whether a newly
crawled web page is a near-duplicate of a previously crawled web page or not.
In the course of developing a practical system for near-duplicate detection, we make two research contributions. First, we demonstrate the effectiveness of Charikar's fingerprinting technique for identifying near-duplicate web pages. We show that for 8 billion web-pages, a good choice of parameters is 64-bit fingerprints and 3-bit Hamming distances. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.