Towards Multi-granularity Multi-facet E-Book Retrieval
Chong Huang1, Yonghong Tian2, 3, Zhi Zhou2, Tiejun Huang1, 2, 3
1Graduate University, Chinese Academy of Sciences, Beijing 100039, China
2 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China
3 Institute of Digital Media, Peking University, Beijing 100871, China
1, 2, 3 {chuang, yhtian, zzhou, tjhuang}@jdl.ac.cn
ABSTRACT
Generally speaking, digital libraries have multiple granularities of semantic units: book, chapter, page, paragraph and word. However, there are two limitations of current eBook retrieval systems: (1) the granularity of retrievable units is either too big or too small, scales such as chapters, paragraphs are ignored; (2) the retrieval results should be grouped by facets to facilitate user¡¯s browsing and exploration. To overcome these limitations, we propose a multi-granularity multi-facet eBook retrieval approach.
Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Information Search and Retrieval; H.2.8 [Database Management]: Database Applications ¨C Data Mining
General Terms
Algorithms, Theory.
Keywords
Multi-granularity, multi-facet, e-book retrieval.
Typically, a digital book (eBook) has two kinds of structures: Hierarchy: it has multiple granularities of semantic units - chapter, page, paragraph and word; Hub: each eBook is a center surrounded by its facets of properties. However, current information retrieval (IR) systems have two limitations when applied to eBook retrieval. First, the granularity of retrievable units may be either too big or too small: a book or all matched words in it. In either case, it is too tiresome for a user to scan through the whole book or search in thousands of matched but off-topic locations. Second, due to the abundance of results, grouping navigation is in need.
To overcome these limitations, we propose a Multi-granularity Multi-facet E-Book Retrieval (MMER) approach. The key to our solution is to extract facet-related information from any granularity, organize them in knowledge networks with hierarchical and radial structure, and finally provide more retrievable units and group results by facets (multi-facet navigation). Moreover, because scores of difference granularity are interrelated, we define a multi-granularity similarity metric, which can be used for multi-granularity ranking in retrieval.
MMER relies on three key technologies: (1) accurate extraction algorithms for both full-text and properties on any granularities; (2) effective knowledge organization model to restore relations included, especially granularity-related and facet-related information; (3) novel usage of these two kinds of information in retrieval.
Multi-granularity Information Extraction. The first issue we encounter is that only book-level metadata is assigned by librarians. Thus, we developed modules to extract information from any granularity. First, we use rule-based algorithm to separate a book into chapters or smaller granularities. Moreover, with the help of TOC files assigned by librarians, we extract the inter-granularity ¡°belonging¡± relation. Second, we extract facet-related information - properties of a text using some machine learning approaches. For example, in our previous work [2], by treating a text as a semantic network, we extracted keyphrases with structural analysis on these networks and small-world model. The result is promising.
Multi-granularity Information Organization. There are three kinds of relational knowledge organization model [1]: thesauri, knowledge networks, and ontology. Thesaurus is mostly restricted in lexical analysis and ontology is suitable for more formalized and proven knowledge with complex relationships. Thus, we organize information in knowledge networks.
To represent hierarchical and hub-like information for eBooks, we propose a Multi-granularity Knowledge Network (MKN) model. MKN has two unique relations, namely, scaling and belonging-to. The weights of these relations are manually assigned or learn from statistical models. Unlike traditional KN, MKN provides hierarchical browsing and facet-based navigation, more accurate book similarity analysis (with relevance ranking) and more.
Two similarity functions are defined to weight the relationships in MKN. A basic similarity function measures the multi-facet similarity of two nodes in one granularity. Currently, we use a variation of cosine distance in VSM model as the basic function. Second, given difference granularities are interrelated, basic scores of related nodes on upper or lower levels are summarized with scaling weights as the final multi-granularity score of two nodes. We also include indirect relationships with a decay factor on distance, by exploring the transitivity of the similarity function. Particularly, we construct two MKNs: BookNet and SubjectNet. Books are connected in BookNet by their multi-facet similarity scores. Similarly, subjects are connected if they concur in the same book.
Multi-granularity multi-facet IR. Based on knowledge in MKN, multi-granularity multi-facet IR approach returns results on both book and chapter level. It includes three key points (as in Fig.2):
Figure 2. The GUI of our eBook retrieval system.
(1)Facets grouping. Because of their common values in facets (such as time, subject, etc.), eBooks are grouped by facets. Users can browse and re-search with facets on the facet tree and panel.
(2)Multi-granularity relevance analysis. A book or a chapter is ranked into a list, according to their multi-granularity similarity scores with the query. Users can access the chapters straightly.
(3)Information visualization (IV) module. An IV module will be invoked if the user double clicks the class name of the IV module. Related subjects are visualized in a network style as in SubjectNet. Introducing IV into query expansion helps users reformulate his/her query.
In this experiment, our aim is to test the effectiveness of all the three modules of MIQS, and further test the effectiveness of multi-scale similarity, and usability of associations, multi-faceted similarity, and multi-scale similarity in eBook retrieval task.
We implement three retrieval systems. The first retrieval system searches the query through a database of all subject fields in MARC entries, which we call Subject. The second searches the query through full text matching, named Fulltext. We use a widely used search engine toolkit Lemur to implement this. The default feature weight of words is BM25, and index type is inverse index. The first two systems are baselines. Our target system searches the query through keywords extracted from the chapters as described in [8], named Chapter. Chapter returns results in chapters rather than books.
Data Generation. We select 544 books from our eBook archive, covering nearly all fields available, from agriculture, arts, economy, engineering, history, mathematics, management and so on. And then we extract keyphrases from each chapter of these books, using SW [8]. In this experiment, we will use the full text, chapter information, and metadata of the entire book.
Evaluation Measure. To evaluate the effectiveness of IR system, precision and recall are usually used. However, in eBook retrieval, it is very tiresome for one to skim through all the books to determine how many of them are related to a query. Moreover, most users won¡¯t be patient enough to browse through all returned results. As a result, some literatures choose top n precision, denoted p@n as the evaluation measure. p@n evaluates how many of results on the top n are relevant to the query. However, we use a variation of p@n ¨C s@n, where s is the relevance score of the association between a query and a book or a chapter, and it is more accurate than precision, where the score is only binary.
As for the criterion of relevance, we carry out a double-blind user survey. We invite users to input any query word they like, and score top 10 returned results, and they are not aware of the technical backgrounds of each IR system and which one or two are baseline(s). For practical limitations, we have 6 users to fulfill this experiment. Since too many options will bewilder users, we constrain the value scope in [0, 10]. As psychology studies, users are prone to choose middle score 5, so we choose discrete relevance scores including 2, 4, 6, 8, and 10. The meaning of each score is listed below. Users can make their decision based on the metadata, table of contents, and full text of each book or chapter. As for Chapter, users are instructed to score the book of the returned chapter first and then the chapter. Otherwise, users will easily over-score the chapters because of the intervention of the book.
Table 4. Definition of optional Relevance Score and mapping between two kinds of scores.
User score |
2 |
4 |
6 |
8 |
10 |
Definition |
No relevance |
A little relevant |
Mediated relevant |
Quite relevant |
Very relevant |
We concentrate on three measures.
Micro average s@10 (Mic s@10). Usually one page displays 10 results. Therefore, we investigate only the scores in the first page. Micro s@10 means the average score of the top ten results for every query. We can see the relevance score of each queries.
Number of result @10 (NoR@10). It indicates the number results of every query, playing a role similar as recall. NoR is sometimes smaller than 10.
Macro average s@n (Mac s@n, ). It shows the average score of the top n results for all queries, as a relevance score inter-query at a certain ranking position.
In addition, if some users choose the same query, we can investigate scoring variance for different users.
Result 1: micro measures. The result of micro measures is shown in the table below. In this table, there are several acronyms: s is short for Mic s@10; s(c) and s(b) stands for Mic s@10 in chapters and books returned by Chapter, respectively; N is short for NoR@10. The figures in bold are the top values in the row. The queries with an asteroid are duplicated queries. N/A of Fulltext in ¡°China¡± might be a result of failure in case recognition of Lemur.
Table 5. Micro measures @10 for each query.
Query |
Fulltext |
Chapter |
Subject |
||||
s |
N |
s(c) |
s(b) |
N |
s |
N |
|
control |
3.01.9 |
10 |
7.22.7 |
5.62.3 |
10 |
72 |
10 |
mathematics |
4.83.4 |
10 |
N/A |
N/A |
0 |
63 |
9 |
beauty |
4.82.5 |
10 |
5.22.1 |
4.61.6 |
10 |
N/A |
0 |
education* |
6.32.7 |
10 |
8.21.8 |
6.33.0 |
10 |
82 |
4 |
multimedia |
N/A |
0 |
N/A |
N/A |
0 |
N/A |
0 |
protocol |
N/A |
0 |
N/A |
N/A |
0 |
N/A |
0 |
China* |
N/A |
0 |
6.02.1 |
4.40.8 |
5 |
10 |
1 |
culture |
5.82.7 |
10 |
5.43.3 |
4.22.6 |
10 |
31 |
3 |
war* |
2.00 |
10 |
7.72.5 |
6.02.9 |
10 |
73 |
10 |
health |
5.23.3 |
10 |
5.83.7 |
6.43.6 |
10 |
2 |
1 |
depression |
5.01.4 |
10 |
8.02.0 |
5.31.2 |
3 |
N/A |
0 |
vitamin |
4.62.5 |
10 |
6.82.9 |
5.82.4 |
10 |
4 |
1 |
population |
5.23.6 |
10 |
7.03.7 |
6.03.7 |
10 |
66 |
2 |
symphony |
4.42.8 |
10 |
8.00 |
8.00 |
3 |
N/A |
0 |
sculpture |
7.03.3 |
10 |
10.00 |
10.00 |
10 |
10 |
1 |
Regarding accuracy, theoretically, given the subject assigned match user needs, Subject should have the highest scores, but the result is that s(c) of Chapter outperforms others in most occasions (9/15). There might be several causes. First, since both the decision made by the subject assigner (librarians) and the users are subjective, the scarcity of returned books in Subject makes the score more unstable, which is a possible reason for the fluctuant performance in Subject. Second, a keyword usually has different meanings in different contexts. Mismatching of context-related meaning brings low score. Finally, one highly-related book usually possesses several highly-related chapters. When more than one of these chapters is returned in Chapter, it has a higher mic s@10. Moreover, chapter is a scale locating more accurately than book. In a word, a chapter of Chapter has advantages in accuracy of both similarity scores and locating the user to the text he wants, and there is no need to skim through the entire book.
Second, Chapter has a comparative NoR as Fulltext, which is significantly higher than Subject. Empirically speaking, Fulltext should have the highest recall (NoR), since it has the largest word set. Together with accuracy, Chapter returns more in results with higher relevance scores.
Note that there are some books returned only by Chapte. Though its s(b) is low, its s(c) is high, since some related highly chapters are from seemingly unrelated books, and these chapters won¡¯t be discovered by the other two baselines.
But Chapter still has its short comings in some general words such as ¡°mathematics¡±. The reason lies in the mismatching between stemming and destemming. We can see ¡°mathematic¡± and ¡°mathematical¡± as keywords in the chapter, but our current stemmer and destemmer fail to map them to a stem. There is room for improvements on this by modifying Porter Stermmer and our destemmer.
Result 2: macro measures. In the figure below, we can see that s(c) outperforms others in all top@n (). The variance of Mac s@n of FullText, s(c), s(b), and Subject are in the scope of [2.87, 3.37], [2.33, 2.81], [2.62, 2.81], and [3.04, 3.60], respectively, so s(c) has a relatively low variance. The trends of these lines differ: Fulltext going down drastically; the two lines of Chapter touch down a little and stay in a relatively steady way; Subject has fluctuations with an increasing trend. The reason is different ranking schemes of these three systems. Fulltext returns documents with highest BM25, Chapter ranks the results by means of SW weights of chapters, and there is no significant ranking difference in Subject if they are of the same type of matching (topic term match or subproperty match, fully match or partial match).
Figure 7. Macro average s@n for all three systems.
Result 3: user variance. Thanks to the existence of replicated queries (each is selected by two users), we can further study the variance of different people on the same query and books/chapters. Given the name of the system, a query, and a book/chapter, top 10 results are in consideration. Intuitively, we treat the relevance score of these 10 results (though sometimes it is less than 10) as a score vector. The variance between two users is the cosine value of these two vectors. The result is surprising. Seen from the table below, the cosines values are all very near 1, which means they are parallel or on the same line, and they are identical. To some extent, this validates the fact that result 1 and result 2 we get earlier have a strong generalizability. The conclusion might be that users with similar educational background tend to make similar judgments on relevance score.
Table 6. Relevance score variance between users.
Query |
Fulltext |
Chapter |
Subject |
|
S(c) |
S(b) |
|||
education |
0.854 |
0.938 |
0.931 |
0.990 |
China |
N/A |
0.963 |
0.980 |
1.000 |
war |
1.000 |
0.930 |
0.929 |
0.990 |
Our thanks to NSFC (No. 60605020) and The 863 Project of China (No. 2006AA01Z320 and No.2006AA010105) for funds, and anonymous reviewers, Yuanning Li, Junhua Zhao, Dou Shen and Shenghua Bao for comments.
[1] Hodge, G. Systems of knowledge organization for digital libraries. Digital Library Federation, USA, 2000.
[2] Huang, C., Tian, Y., Zhou, Z., Ling, C., Huang, T. Keyphrase extraction using semantic networks structure analysis. In Proc. of the ICDM¡¯06, pp. 275-284, Hong Kong, 2006.[pdf]
[3] MIQS: A multi-granularity multi-faceted E-book Search Engine. http://159.226.42.40/IQuery
Copyright is held by the author/owner(s). WWW 2007, May 8¨C12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. |