Track: Search
Paper Title:
Designing Efficient Sampling Techniques to Detect Webpage Updates
Authors:
Abstract:
Due to resource constraints, Web archiving systems and search
engines usually have difficulties keeping the entire local
repository synchronized with the Web. We advance the state-of-art
of the sampling-based synchronization techniques by answering a
challenging question: Given a sampled webpage and its
change status, which other webpages and how many of them are also
likely to change? We present a study of various downloading
granularities and policies, and propose an adaptive model based on
the update history and the popularity of the webpages. We run
extensive experiments on a large dataset of approximately 300,000
webpages to demonstrate that it is most likely to
find more updated webpages in the current or upper directories of the
changed samples. Moreover, the adaptive strategies outperform
the non-adaptive one in terms of detecting important changes.