Estimating the Cardinality of RDF Graph Patterns
1Department of Computer Science
3Department of Statistics
{maduko, anyanwu}@cs.uga.edu,
pdschlie@stat.uga.edu
Department of Computer
Science and
amit.sheth@wright.edu
ABSTRACT
Most RDF query languages allow for graph structure
search through a conjunction of triples which is typically processed using join
operations. A key factor in optimizing joins is determining the join order
which depends on the expected cardinality of intermediate results. This work
proposes a pattern-based summarization framework for estimating the cardinality
of RDF graph patterns. We present experiments on real world and synthetic
datasets which confirm the feasibility of our approach.
Categories and Subject Descriptors
H.2.4 [Systems]: Query processing
General Terms
Management, Performance.
Keywords
RDF Query Processing, RDF Semantic Summary, RDF Structural Summary, Statistical Summary, Pattern Cardinality Estimation.
Most RDF query languages allow for graph structure search through a conjunction of triples [3], often processed using join operations. A key step in join query optimization is determining the join order, which depends on the expected size of intermediate results. For example, to detect a basic potential conflict of interest in the paper review process of a conference, one may search for PC members who are authors of submitted papers. Figure 1a and b show the SPARQL expression for this query (namespaces are omitted) and its graph structure (pattern). The query requires three join operations. One order first finds all reviewers who are authors of submitted papers, then the result is joined to the set of papers submitted to “WWW”. Figure 1d, e and f show the possible ways processing may begin. The optimal join order requires good estimates of the cardinality of these patterns from a summary of the frequency distribution of patterns. Also, a summary that is seamless across RDF systems is desirable. This work is a first step towards addressing this problem. We focus on structure queries with no variables on the edges (as in Figure 1a) which are useful in real applications. This work addresses the problem stated as: Given a space budget B and RDF schema graph GS and instance graph GI, summarize the frequency distribution of patterns in GI to fit B for obtaining good estimates of the frequencies of patterns. We propose and evaluate three techniques which differ in terms of the value/size tradeoff made during summarization.
In our graph model for RDF, nodes denote subjects or objects of schema (instance) triples while edges denote the predicates relating the respective subject and object nodes.
Figure
1: A graph pattern query and its sub-graph
patterns.
2.1 Pattern Sequencialization
To summarize the frequency distribution of
patterns, we desire a generic representation for all embeddings of a query
pattern. To do this, we summarize the Schema graph to contain all patterns that
exist and may exist in the RDF instance graph. This summary which we call the RDF Semantic and Structural Summary (GSSS)
extends the RDF Semantic Summary [1]. We adopt the Minimum DFS Code (MDC) [4] of a pattern as its canonical label. Simply put, if
all edges of a graph are assigned labels from a finite set of labels with a
total order, the MDC of the graph is the lexicographically smallest sequence of
edge labels obtained by performing a DFS traversal of the graph. (See [4] for details). We
assign a 5-tuple <i, j, g(l),
h(u), h(v)> to each edge e = (u, v) labeled l, where i and j are the DFS discovery
times of u and v, h maps u/v to the number of its GSSS
node while g maps l to the number of the GSSS edge of
the same label.
2.2 Pattern Cardinality Estimation
Given a query pattern GQ, we first validate GQ in GSSS to ensure it existed in the instance graph. If so, we obtain its representation GQS. Next, we check if the summary contains GQS. If so, we return its cardinality. If not, we assume that minimal size super-patterns of a pattern are uniformly distributed (the size of a pattern is the number of edges it contains). Thus, we return a product of the cardinality of its maximal size sub-pattern GQSM and the pattern growth rate of GQSM. If GQSM has D unique minimal size super-patterns (that are not in the summary) with total cardinality N and if GQSM has cardinality C, the growth rate of GQSM given by N/CD is used to compensate for the frequency anti-monotone property which may not always hold. If GQS has multiple maximal sub-patterns, we greedily select the one with the least growth rate.
2.3 Summary Construction
Our technique exploits 1) the fact that patterns may have nearly the same frequencies as their sub-patterns 2) prior knowledge of the importance of patterns. Thus the concept of a value for patterns forms the basis of our summary construction. The value of a pattern derives from its observed and estimation values. The observed value (tailored for different applications) defines known information about pattern importance. Given a set of patterns P = (P1, P2, …, Pm) with frequencies (PF1, PF2, …, PFm), if POI = (POI1, POI2, …, POIm) is a vector containing the prior information about patterns such that 0 £ POIi £ 1, we define the observed value of a pattern Pi (POVi) as POIi ´ maxj{PFj}. The estimation value of a pattern defines the number of its minimal size super-patterns whose cardinalities are estimable from its cardinality within an e error. We derive e using an exponential increase/logarithmic decrease approach to refine an initial small e until we obtain a set PS such that the total size of patterns in PS meets the budget and their total estimation value is at least the total size of patterns not in PS. Based on the observed and estimation values, we value patterns as follows: Let P = (P1, P2, …, Pm) be a set of patterns, let PEVmax and PFmax be the maximum of the expected values and frequencies of patterns in P respectively. Given a constant c Î [0, 1] and an indicator variable i that takes the value 1 if POVi ³ c ´ PFmax and 0 otherwise, the value of any pattern Pi is given by:
PVi = (1 + PEVi)(1 + POVi)
+ i(c ´ PFmax ´ PEVmax)
The additive constants in the first term of the equation ensure that the value of a pattern is non-zero when either its observed or estimation value is non-zero. The constant c in the second term allows the summary to be tuned to suit application needs by favoring some patterns.
Our
construction algorithm first generates all patterns of up to a fixed size d then
less valuable patterns are pruned until the budget is met. Given a budget B and vectors P = (P1, P2, …, Pm),
PS = (PS1, PS2,
…, PSm) and PV
= (PV1, PV2, … PVm) of patterns of size at most d, their sizes (in bytes) and their values, we
consider three alternatives for selecting patterns to meet the budget.
Optimal
Solution (OPT). The optimal
solution maximizes åwiPVi constrained on åwiPSi £ B, (wi is an indicator variable whose value is 1 if PSi is part of
the solution or 0 otherwise). We solve this problem using dynamic programming
by observing that:
V[m, B] is the maximal value of patterns with total size at most B.
Since this solution 1) may select lower valued patterns over higher ones and 2)
is NP-hard, we consider two greedy approximations.
Maximal
Value Greedy Solution (MVG).
Patterns are selected in decreasing order of value to favour less error
tolerant patterns.
Maximal Unit Value Greedy Solution (MUVG). Patterns are selected in decreasing order of
value per unit size.
We experimented on
real life (swetoDBLP [5]) and synthetic (Lehigh University Benchmark [2]) datasets which differ in the number and complexity
of patterns. For properties (#nodes, #edges, #edge labels, average degree,
#connected components (cc), min #edges in a cc, max #edges in a cc, size of
patterns (bytes)), we used SwetoDBLP with (95566, 120512, 31, »3,
135, 4, 106479, 227149) and LUBM2 with (38370, 113463, 12, »6,
1, 113463, 113463, 23878). (Excluding literal triples). We compared OPT,
MVG and MUVG using summary sizes of 5KB, 25KB and 50KB for SwetoDBLP and
scaled down sizes of 800B, 4KB and 8KB for LUBM2 (same proportion of ratio of
number of unique patterns to summary size). We also compared these with the Histogram-based solution (HS) of an RDF
Database system, but for smallest summaries since we could not control the summary size of HS. We used three
query workloads; 1) positive workload
for patterns with cardinalities > 0; 2) frequent workload for patterns with
cardinalities ³ 5000 and 3) negative workload for patterns with 0 cardinalities. Using
the absolute error metric |Ci – Ci¢|, (Ci and Ci¢ are
the true and estimated cardinalities), we cumulate the percentage of patterns
estimated with at most e error, e Î [0, E] where E is the maximum estimation error. We obtain pattern cardinalities for HS using
the Explain command, after
instructing the system to create detailed statistics for the datasets.
(b) (a)
Figure 2: Performances on (a) SwetoDBLP and (b) LUBM2
Figure 2a shows the results for the 5KB summary of SwetoDBLP, for the positive workload. HS performed worst with estimation error of at least 1024 for all patterns, compared to 20% for OPT and MVG and about 35% for MUVG. OPT, closely followed by MVG, had 100% accuracy and an error of at most 32 for 40% and 60% of the patterns. OPT performed best overall. Figure 2b shows the results for LUBM2. Due to the irregularity of LUBM2, the techniques have larger estimation errors, but given the summary sizes used, their performances are promising. For example, OPT performed best with 100% accuracy on 60% of all patterns and an estimation error of at most 1024 for 10% of patterns on the 8KB summary. We also obtained promising results for the frequent workload on both datasets. For the negative workload, HS performed best on SwetoDBLP with 100% accuracy for about 80% of patterns and worst on LUBM2, producing non-zero estimates for all patterns. OPT, MVG and MUVG produced non-zero estimates for 90% of patterns for the datasets. These false positives indicate that spurious patterns exist in GSSS. We note that our estimation time is negligible, 0.00031 seconds on the average.
4. FUTURE WORK
For future work, we will investigate
techniques for estimating the cardinalities of larger patterns by combining
those of a number of smaller ones. We also hope to improve the accuracy of
negative queries and investigate how to handle updates.
5. ACKNOWLEDGMENTS
This work is funded by NSF-ITR-IDM Award #0325464 and #0714441 titled ‘SemDIS: Discovering Complex Relationships in the Semantic Web’.
[1] Anyanwu, K., Maduko, A., Sheth, A.
SemRank: Ranking Complex Relationship Search Results on the Semantic Web. WWW 2005.
[4]
Yan, X., Han, J.
gSpan: Graph-Based Substructure Pattern Mining. ICDM 2002.
[5] http://lsdis.cs.uga.edu/projects/semdis/swetodblp
Copyright is held by the International World Wide Web Conference Committee (IW3C2).
Distribution of these papers is limited to classroom use, and personal use by others.
WWW 2007, May 8--12, 2007,
ACM 978-1-59593-654-7/07/0005.