WWW 2007, May 8-12, 2007,
Jianjun Yu Shengmin Guo Hao Su Hui Zhang Ke Xu
State Key Lab. of Software Development Environment
This paper describes a kernel based Web Services (abbreviated as service) matching mechanism for service discovery and integration. The matching mechanism tries to exploit the latent semantics by the structure of services. Using textual similarity and n-spectrum kernel values as features of low-level and mid-level, we build up a model to estimate the functional similarity between services, whose parameters are learned by a Ranking-SVM. The experiment results showed that several metrics for the retrieval of services have been improved by our approach.
H.3.3 [Information Storage and Retrieval Information]: Search and Retrieval - retrieval models, search process, selection process
H.3.5 [Information Storage and Retrieval]: Online Information Services - Web-based Services
Algorithms, Languages, Design, Experimentation
Web Services, Web Services Matching, WSDL, n-Spectrum Kernel, Ranking SVM
A hotspot in the research of service is to realize large-scale service discovery and integration in Internet. Conventional matching mechanisms, like in UDDI, are mainly based on keyword search, however, the precision of those approaches are relatively low. To address this problem, several approaches considering structural information have sprung up [1,3,6]. They significantly improve the precision with the requirement that matched services should be similar in structure. But in practical, those approaches are too strict to recognize similar services with different data structure encapsulation. Consequently, they have low recalls.
Therefore, we put forward a novel loose tree matching algorithm which extracts the structural features from another perspective to help improve the recognition of functionally similar services.
In our algorithm, services are schemed by WSDL (Web Services Description Language) as tree-structured documents, and two kinds of features are extracted.
One kind is to calculate two WSDL documents' text similarity with a classical VSM (Vector Space Model) , regarding the documents as unstructured text. In the classical VSM, a document is formalized as a vector, and each dimension of the vector represents a word in the document. The value of the dimension is calculated by the formula  as an estimation of the importance of the word in the document.
The other kind of features describes the structural similarity. It takes two steps to get such features. First, we have to do some preprocess to extract document trees from the two WSDL documents and align their nodes according to the label's textual similarity. Aligned nodes are considered identical. After the preprocess, we model the documents trees as a vector in a n-spectrum vector space , and use the newly brought forward n-spectrum kernel function  to compare how much hierarchical relationships the two trees share in common. With a set of different , we have a set of different function values as features.
Modeling a document tree as a vector in the n-spectrum vector space is similar to modeling a text document stated above, except that each dimension in the n-spectrum vector space represents an n-path subsequence defined as below:
A path subsequence (abbreviated as ) is an ordered set of nodes extracted from a path of the document tree (abbreviated as ). The set containing all ( is the number of nodes in a ) of the is called the n-spectrum of , denoted by . Intuitively, two document trees are more similar if they share more common .
Then, we consider the problem of a weight in the vector. A may not be picked contiguously from a path, however, we assume that contiguous picked ps should weigh more. So, we define a decay factor to weight the presence of gap in a . For the indices identifying the occurrence of a in a document tree , we use to denote the distance between the first node and the last node in the corresponding path, namely, the difference of their depth in the tree. In the gap-weighted kernel, we weight the occurrence of the with the exponentially decaying weight .
With the set of and their weights in the vector, we can give the formula of n-spectrum kernel value:
DEFINITION 1 (GAP-WEIGHTED N-SPECTRUM). The feature space F associated with the gap-weighted spectrum kernel of n-ps is indexed by (the set of all n-), with the embedding given by
Where p is the indices identifying the occurrence of . The associated kernel is defined as:
where and .
With the textual and structural similarities, we can
estimate the functional similarity as follows:
where , and are two services, and are their corresponding document trees, is their textual similarity, and is the gapped i-spectrum kernel describing their structural similarity.
The parameters are estimated by the Ranking SVM . Ranking SVM is brought forward by machine learning researchers in recent years to solve ordinal regression problem, and it has been proved to be effective . In this paper, the retrieval of matched services is formalized as a ''learning to rank'' problem, where the matching degree between query and returned services are mapped into three ordered categories (ranks). We have three features in the ranking problem: text similarity, 2-spectrum and 3-spectrum kernel values, and their weights are in Equation (3). Our experiment result shows that the best values of are and the best values of are in our data sets.
The experiment data set of services is collected from WWW, containing 2,140 valid services in total.
In this paper, we define three categories for the judgment of the matching degree between the query and the WSDL documents: 'good, 'insignificant' and 'bad'.
Generally, we hope that a good return can be modified with little effort to replace the query in Web environment. Therefore, it should provide almost the same function to the query. A bad return may only be textually similar to the query document with no obvious structural similarity. An insignificant return is the one with a similarity degree between good and bad.
As a baseline method, we simply use the text similarity to rank and categorize the candidates into three categories. This method does not take structural information into consideration at all. As another baseline method, we use the Edit Distance from Gabriel Valientes book  to get the similarity between WSDL documents.
We made use of four evaluation metrics, which are ''error rate'', ''R-precision'', ''Top-N precision'', and ''Average precision''. The evaluation result on the test set is listed in Figure 1.
We have our evaluation on two levels. The first group is evaluated using 'good', while the second group is evaluated by replacing 'good' with 'good' and 'insignificant'. Taking a closer look at Figure 1, we notice that Ranking-SVM outperforms other approaches in every metric. As an interpretation of the result, we may regard VSM as low-level feature and 2-spectrum and 3-spectrum as mid-level features with structural information. The low-level feature is directly extracted from WSDL with less deeper semantic and less noise, while mid-level features reflect deeper semantic with more noises. Ranking-SVM combines VSM, 2-spectrum kernel and 3-spectrum features together, so that low-level feature and high-level features complement each other; as a result, it outperforms all other methods. We may also notice that the R-precision and Top-5 precision for 'good' are relatively equal. In fact, the average number of relevant documents to the queries is approximately 5.
In this paper, we present a novel approach for services matching problem. In order to achieve the task, we model WSDL documents in the vector space and then defined the gapped n-spectrum kernel function in the space. Using textual similarity and n-spectrum kernel values as features of low-level and mid-level, we build up a model to estimate the functional similarity between services, whose parameters are learned by a Ranking-SVM. Experimental results indicate that our model significantly outperforms other methods. Since the n-spectrum kernel function is defined in vector space, the framework of our approach can be easily adapted to matching problems of other domains.
Searching XML Documents via XML Fragments.
In the Proceedings SIGIR'03, 2003.
Cluster Analysis. 2nd edition.
A Comparison of Web Service Interface Similarity Measures.
J. Shawe-Taylor and N. Cristianini.
Kernel Methods for Pattern Analysis.
Algorithms on Trees and Graphs.
S. Eleni, O. M. E., W. Sanjiva, P. M. P., and J. Yang.
Semantic Structure Matching for Assessing Web Service Similarity.
In the Proceeding of ICSOC'03, 2003.
J. Xu, Y. Cao, H. Li, and M. Zhao.
Ranking Definitions with Supervised Learning Methods.
In the Proceedings of WWW, 2005.