Track: Search
Paper Title:
Combining Classifiers to Identify Online Databases
Authors:
Abstract:
We address the problem of identifying the domain of online
databases. More precisely, given a set F of Web forms automatically
gathered Web by a focused crawler and an online
database domain D, our goal is to select from F only the
forms that are entry points to databases in D. Having a
set of Web forms that serve as entry points to similar online
databases is a requirement for many applications and techniques
that aim to extract and integrate hidden-Web information,
including meta-searchers, database selection tools,
hidden-Web crawlers, form-schema matching and merging,
and in the construction of online database directories.
We propose a new strategy that automatically and accurately
classifies online databases based on features that
can be easily extracted from Web forms. By judiciously
partitioning the space of form features, this strategy allows
the use of simpler classifiers that can be constructed using
learning techniques that are better suited for each partition.
Experiments using real Web data in a representative
set of domains show that the use of different classifiers leads
to high accuracy, precision and recall. This indicates that
our modular classifier composition provides an effective and
scalable solution for classifying online databases.