With the rapid development of new technologies, both ordinary users and service providers are experiencing the coming wave of the next-generation Web. As a representative, tagging based websites, such as Del.icio.us1 and Flicker2 , have achieved a significant success. Their low technical barriers and the easy use of annotations have attracted lots of Web users. Millions of annotations are freely and openly assigned to the digital items like web pages, photo images and blog posts. Now, annotation is not only a method for organizing contents to facilitate the users who create it, but also a navigation mechanism for users to discover interesting resources. It has become a new interface of Web and has drawn much attention from both research and industrial communities.
Currently, there are two main methods of helping users to seek the information through annotations. The first one is the keyword-based search, which is the most common way for finding information on the Web. Systems of this type will display all contents associated with the given annotation. The second one is a method called tag cloud view [20]. It usually displays the social annotations alphabetically with different font sizes and colors indicating their popularities. Selecting a specific annotation will generally lead to the keyword search with the selected annotation as input. Compared with the direct searching method, tag cloud provides a better user interface for browsing the popular social annotations. However, the drawbacks of the methods are obvious, especially when the scale of social annotations is quite large:
In this paper, we consider the problem of browsing large scale social annotation data. An effective algorithm, Effective Large Scale Annotation Browser (ELSABer) is proposed. Compared with the previous method, it has the following advantages:
Given the personal information and the time-related restrictions, ELSABer can be easily enhanced. By incorporating users' profiles, ELSABer can display the social annotations and contents according to users' preferences. By further incorporating the time restrictions, ELSABer can display the popular annotations and contents within a specific time interval. These two extensions are useful in helping the user to discover the personal interested resources and the popular resources.
A prototype system is implemented based on ELSABer. Figure 1 gives a snapshot of the system. The page behind is the initial interface of the system. It contains popular annotations distributed in different clusters. The size of each annotation indicates its popularity. The page in the front is the result after a user selects the annotation "programming". On the right side is a set of pages related to the current annotation. Each line on the left side is a sub-category of the current annotation, which consists of several related annotations. Users can click the tag on the left side to further investigate that category. In this paper, we use the terms "annotation" and "tag" interchangeably.
The rest of the paper is organized as follows. Section 2 briefly reviews the studies of social annotation. Section 3 gives an overview of our algorithm. Section 4 describes each component in our social browsing algorithm. Section 5 discusses how to extend the algorithm in two ways. Section 6 presents the experimental results. Finally we make a conclusion in Section 7.
Recently, there are lots of studies on social annotations, including blog posts, interesting demos and academic research papers.
The early discussion of the social annotation can be found in [1, 3, 4, 5, 11]. They initiated the idea that sharing tags can lead to the concept known as "folksonomy". The term first appeared in an information architecture mailing list [12]. Quintarelli [5] suggested that we should take social annotation as an information organizing tool. In [1], Golder et al. gave the specific analysis of the social annotation data in Del.icio.us in both the static and dynamic aspects. In [3], the authors gave a good review of available social bookmark tools.
Research based on social annotation has been done in various areas such as semantic web [6], social network [9], and enterprise search [8]. In [9] Mika proposed a bipartite model of ontology with a social dimension and found that the semantic relationships among tags are based on their co-occurrence with users or resources. In [6], Wu et al. used a probabilistic generative model to obtain the emergent semantics hidden behind the co-occurrence of three types of data (tags, users and resources). They also proposed a framework for semantic search based on their emergent semantic model. In [8], Dmitriev suggested that folksonomies were not limited to the blog sphere but also benefited enterprise search. An annotation tool was implemented within an enterprise environment and improved the search efficiency. In [7], the authors analyzed the effectiveness of tags for classifying blog entries and argued that there is a topical hierarchy among tags. However, the hierarchy, which is a static rigid binary tree lack of semantic control, is not suitable for the social browsing problem.
All the above research is different from ours. Their work is about discovering and utilizing the features of social annotations instead of focusing on how to browse annotations themselves.
There are a few studies on visualizing and browsing the social annotations. Dubinko [10] proposed and solved the problem of visualizing the evolution of tags within Flickr online image sharing service. They gave an efficient algorithm for processing the large data in real time. Their work focuses on discovering the hot images and tags in a pre-defined time interval. It is not a proper solution for users to browse all annotations. In [21], Begelman applied the clustering algorithm on the social annotations to improve users' browsing experience. Their algorithm can not handle the synonymy and ambiguity problems. Our work is different from theirs. We proposed the browsing framework with three features, including the solution for a semantic browsing.
Some demos for visualizing tags are also available on Web. Grafolicious [22] produces graphs illustrating when and how many times a URL has been bookmarked in Del.icio.us. HubLog [23] gives a graph of related tags connected with the given tags. Although these demos gave a vivid picture of social annotations in different aspects, their goals are not to help users to browse annotations effectively.
In this section, we give an overview of the ELSABer algorithm as shown in Algorithm 1. The algorithm is generally designed for any social annotation environment, e.g., Del.icio.us, Flickr. In this paper, we use Del.icio.us annotations for analysis and evaluation.
From step 1-1 to 1-2, the algorithm initializes the first view of annotations. NT, NU, NC, and NCT denote the number of tags, URLs, clusters and tags in each cluster. In our experiment, these parameters are set to 2000, 2000, 20, and 5, respectively, which means the top 100 tags distributed in 20 clusters on 2000 most frequent tags and URLs are presented to the users as the default browsing interface. These popular tags, which are associated with a large number of resources, are selected as the roots in hierarchical browsing. When users select a tag as the entrance to annotation browsing, the algorithm outputs its related resources and a set of annotations as sub-tags. Users can iteratively select any annotation from the displayed sub-tags for further exploration. The iterative process consists of four components as follows:
Then the user can click one of these presented sub-tags to further seek his desired resources.
In this section, we will give a detailed description of each component in ELSABer. Before the discussion, we first give a formal representation of annotation data.
Del.icio.us provides a popular tool for organizing bookmarks. A description online [2] states it as:
"A social bookmark manager. It allows you to easily add sites you like to your personal collection of links, to categorize those sites with keywords, and to share your collection not only between your own browsers and machines, but also with others".
In Del.icio.us, an annotation activity typically consists of four elements: an annotator, a URL, a tag, and a tagging time. We define an annotation as a quadruple:
We disregard the roles of User and Time, and view the annotation data as points in a high dimensional space called the tagging space. The annotation data can be represented as an m×n matrix C, where m and n is the total number of tags and URLs, respectively. Let Cij denote the number of users who annotate the jth URL with the ith tag. Let M bethe m×n association matrix and Mij denote the association degree between the ith tag and the jth URL. A simple method is to let Mij equal to Cij.In our experiment, we borrow the idea of TFIDF from the IR field and calculate the association weight as follows:
where |URL(ti)| represents the number of URLs annotated by ti.
Given the matrix M, the tag can be represented as a row vector Ti (U1,U2,.. Un) of M. Similarly, the URL can be represented as a column vector Ui (t1,t2,...,tm) of M.
People annotate web pages mainly for organizing web pages with different contents, so annotations are usually abstracts of the contents of these web pages. Abstracts of the same web page are usually similar. Below, we give the first observation of social annotations:
Observation 1: similar tags will annotate similar URLs and similar URLs will be annotated by similar tags in the social annotation environment.
Based on observation 1, the semantics of a tag can be reflected by resources which it tagged. The semantic relation is derived from their co-occurrences. As shown in Figure 2, T1 and T2 are similar tags since they share similar URLs U3 and U4. T1 and T3 are less relative. Figure 2 also illustrates that the similar tags also annotate the different URLs.
For measuring the semantic relationship between tags, we propose a symmetric measurement as follows:
where Ti and Tj are tag vectors corresponding to tags ti and tj, respectively. The tag vector Ti is determined by the URLs which are annotated by the tag. So it may vary according the change of the related URLs.
Some linguistic features are also used for calculating Sim(ti, tj). When tags are freely assigned to the relative URL, tags are used in various forms, such as the plural form and gerundial form. For example, "Programs", "Programming" and "Program" all exist in the annotation data. Additional weight is added to Sim(ti,tj) if two terms share the same etyma after porter stemming. Besides, if the two terms share the etyma after eliminating the external punctuations, a lighter additional weight is added to the Sim(ti,tj) score. The two weights are set to 0.1 and 0.08, respectively.
In the social browsing setting, the user tries to find his desired resources by selecting the tag with the closest meaning to his intended information. Therefore, when the user selects "film", those pages tagged by "movies" are also of his interest. In order to provide the user with the complete resources of his interest, we find tags and URLs related to the semantic concept of a tag instead of finding tags and URLs by matching the tag literally. Following, we give the method of generating semantic concepts.
Given the selected tag ti, we choose a set of tags most related to ti, as the synonymic tagset STi ={ tj | tj is similar to ti }. In this paper, the candidate tags STi is generated using the following rules:
where N and θ in the rules are set to 4 and 0.7, respectively. Then, a semantic concept Ci for ti is represented by the following set:
Note that, once a user has clicked L times and forms a click trace of t1, t2,...,tL, we have a sequence of L concepts: C1, C2,...,CL. Let SC= {C1, C2,..., CL}. The related URLs in step L can be defined as follows:
where u is a candidate URL and T(u) means the set of annotations given to URL u.
Given a set of related URLs, the related tags can be defined as all the tags given to ReURL(SC).
Figure 3 illustrates the idea above. The concept set contains two concepts: one consists of "WebDev" and "Web-dev" while the other contains tag "JSP" only. "WebDev" and "Web-Dev" are highly related tags which satisfy the above rules. U1, U3 are the URLs matching the current concept set since U1 is annotated by "WebDev" and "JSP"while U2 is annotated by "Web-Dev" and "JSP". Related tags like "CSS", "AJAX", "2.0" are obtained from the matched URLs. The sub tagging space is formed by these related tags and URLs.
Note that, by tracking the user's selections, the problem of tag ambiguity can be solved, because previous selected concepts play as domain limitations, which can disambiguate meanings of tags in different domain. For example, U4 will not be selected, since it does not match the limitation "Web-Dev".
Quintarelli [5] and Mathes [4] both argue that the tagging space is a flat space and a hierarchical representation of topics does not reflect the associative nature of social annotations. But in [1], Golder states that the different expertise and purpose of tagging participants may result in tags at various levels of abstraction to describe a resource. For example, a photo can be tagged at the "basic level" by "cat", at a super ordinate level by "animal" or at various subordinate levels below the basic level by "Persian cat" or "Felis silvestris cats longhair Persian". Our observation is that:
Observation 2: there is not a neat tree structure like taxonomy or human built ontology with rigid hierarchies and pre-defined categories with clear boundaries for social tags, but the tags used in social annotation do locate in different semantic levels in the social annotation space.
By our observation there are many combined words like "programming/java" and "Design/CSS", which may reflect the needs for hierarchical annotation in Del.icio.us. Several single word tags which are used to annotate the URL by the same user also reflect that the hierarchy exists in the social annotation. For example, there are URLs tagged by "java, jdbc" and "music, jazz". So it's feasible to explore the hierarchical structure of social annotations for hierarchical browsing. The structure has the following features:
Figure 4 illustrates the features described above. There are two hierarchical structures, rooted "programming" and "Design", respectively. The annotations like "WebDev" and "JSP" are shared in both trees. Users can reach the "jsp" tag from either "programming" or "Design".
For each selected tag, a set of related tags are obtained from previous steps. Obviously, not all the tags are proper to be the child node of the current tag to expand the tree structure. Related tags are mainly of the following types:
where U(ti) denotes the number of URLs tagged with ti. Figure 5 illustrates the capacity of this feature. Here, the current tag is "Google" and the Coverage values of 40 of its related tags are calculated and compared. The types of these tags are manually labeled. The figure shows that tags at a super ordinate level such as "Web" and "computer" have much higher Coverage values than other tags.
It is the ratio of the number of ti and tj's common URLs to the number of ti's URLs. If the intersection URL set is the main part of all the URLs of ti, but a small part of tj, we can infer that ti is a sub-tag of tj. For example, most URLs tagged by "gmaps" are also tagged by "Google", but a small number of URLs tagged by "Google" are tagged by "gmaps", so we infer that "gmaps" is a sub-tag of "Google".
Note that we should pay attention to the size of U(ti), if U(ti) is as small as 1 or 2, the IRij above will give the tag ti a high score and take it as a sub-tag with high confidence. But the tag is likely to be meaningless tag or noisy tag. We use equation 8 for solving this problem; we set a threshold on the size of U(ti),. We also introduce a parameter λ for smoothing the results. In the experiment, both the threshold and λ are set to 5.
Figure 6 shows the capacity of this feature. The current tag and the data set are same as those in Figure 5. We can see from Figure 6 that tags at subordinate levels like "gmaps" and "earth" have much higher IR values than other tags.
Given the features above, each related tag is represented as a feature vector. A decision tree can be derived from the manually labeled data set to predict the sub-tag relations using C4.5. Figure 7 shows a snippet of the rules we got.
For categorizing the sub-annotations generated in the last step, we present a clustering algorithm, which successfully solve the following problems:
In the light of the above discussion, traditional clustering algorithms like K-means [14], which purely rely on a predefined cluster number is not proper for our problem. Clustering algorithms based on graph partition [13] will give an optimal partition of the graph, but complexity of these algorithms prohibits them to be applied in a real-time browsing problem. Here we propose a dynamic clustering algorithm for the social browsing problem, as shown in Algorithm 2. It may not be the optimal solution according to the graph theory, but it is a proper solution for social browsing. The label of each cluster is also generated during the clustering process.
The algorithm first introduces an informative rank over the candidate sub-tags based on the following tag properties:
Tag Frequency/ Inverted URL Frequency: This property indicates the annotation's importance and is defined in the same way as Equation (1).
Intra-Cluster Similarity: This property, namely ICS was used to measure whether a tag is a good representation of a single topic.
Where ot denotes the centroid of all the URLs associated with the tag and each URL in this tagging space can be represented as a vector ui=(t1,t2,...tn) (see section 4.1):
where U(t) represents the number of URLs associated with tag t.
Tag Entropy: This property, denoted by TE, is used to calculate the distinctness of a tag [15]. A tag which seldom shares URLs with other tags is more likely to be a cluster.
Finally, the informative score for tag t is defined as the linear combination of all the above properties:
We decide the weights w1, w2 and w3 by using a linear regression model over the manually labeled data set. In our experiment, these weights are 0.58, 0.27, and 0.13, respectively.
After obtaining each tag's informative score, we select the most informative tag as the label of the first cluster, and find its similar tags using Equation 2 and its related URLs by calculating the cosine similarity between these URLs and the centroid of all the similar tags. Then we remove its similar tags and related URLs from our dataset. This process terminates when no remaining tag has enough number of related URLs.
Nowadays the number of tags and URLs are increasing exponentially with the development of Web 2.0 and extensive application of tagging services. Therefore, the efficiency of our algorithm will be influenced. In this section, we will discuss how to accelerate our algorithm. First we give the 3rd observation based on our analysis on Del.icio.us.
Observation 3: Popular tags and URLs play an important role in our social annotation data. People use popular tags to annotate URLs and also the popular URLs are annotated by the majority of tags.
Figure 8 and Figure 9 demonstrate Observation 3 illustratively. X axis represents tags in the order of their counts and Y axis represents the counts of the tags. Figure 8 illustrates the distribution of the counts of tags which are associated with a certain URL. We discover that people always use most popular tags to annotate the URL and unpopular tags are barely applied to annotate. Figure 9 demonstrates the distribution of the counts of tags associated with the whole Del.icio.us data. We find out that the popular tags are frequently and extensively used in the whole Del.icio.us data although there are thousands of tags used.
The responding time of our browsing algorithm is the key to the users' experience. For the sake of efficiency, we borrow the idea of the inverted index from the IR area to index both the tag vector and the URL vector. However, direct application of this indexing scheme would still be inefficient because the tagging space has billons of annotations. To overcome this difficulty, we introduce a sampling method based on Observation 3 to limit the time complexity of our algorithm to a proper scale in spite of the huge size of social annotation data.
According to Observation 3, we discover that the content of a URL can be reflected by the most popular tags and also the tag semantics can be decided by the most popular URLs. So we can get good results efficiently by running our algorithm in a small sub tagging space consisting of tags and URLs that are most frequently used and annotated, i.e., sampling K most frequently annotated URLs and K most frequently used tags from the dataset to form a sub tagging space for the algorithm. In our experiment, we set K to 2000, so the size of M is 2000 × 2000.
Note that we do not cut off the "long tail" of tagging space, although we use the sampling method. In our algorithm, the more important a tag is, the earlier it will emerge. Based on the discussion above, these important tags cover the majority of URLs, thus should be located at higher semantic level and presented to the user earlier. After the user click one of these tags, related URLs associated with this tag will be discovered, which will bring related tags including both popular and unpopular ones. Therefore we do not lose the connection with the "long tail". After a sequence of click by the user, the intention of the user will be more specific, this causes a decreasing number of related URLs or related tags. When the number is less than 2000, all the tags and URLs will be calculated, which means that the "long tail" is covered and the sampling method is not applied. Figure 11 shows the URL coverage of popular tags in Del.icio.us. Axis X is the ith popular tag, and Axis Y is the URL coverage percentage. It shows 84% of URLs in our data set associate with the top 2000 tags.
Different people have different browsing preferences, since users have different interests. The personalized browsing provides the user with annotations more closely match his interests by utilizing user preferences. People are also interested in recent hot topics. The time-related browsing can discover annotations according to their popularity within a specific time interval. In this section, we will show that our browsing framework can be easily extended to fit the requirement of personalized and time-related browsing.
Personalized browsing has been well studied in browsing interface [16], personalized website browsing [17], personalized webpage recommendation [18], etc. Here, we are to provide the personalized social annotation browsing.
In previous personalized systems, additional effort is usually required to build a user profile which is generally time consuming, and the generated profiles are sometimes out of date. In our social annotation environment, the profile of a user can be directly obtained from Del.icio.us and dynamically modified according to the changes of the user's interests over time. Assuming that User Ua is a registered user of Del.icio.us, his profile is represented as a set of triples:
Given the profile P(Ua), the social annotations can be classified into three categories as shown in Figure 11:
The user interested annotations and resources can be found as follows:
where UI(Ri|P(U)) denotes the degree of interest between user U and resource Ri, while UI(Ti|(P(U)) denotes the degree of interest between user U and annotation Ti. Ri denotes the vector representation of a resource, and Ti denotes the vector representation of Ai.
Given this quantitative evaluation of user interests of each annotation and the resource, we extend the basic social browsing model to the personalized model as follows:
As described in [19], the current Web is a sensor of the real world in some sense. In most cases, users are often interested in browsing the popular web pages in recent time. Thus providing time-related browsing would be helpful to most users. Recall that the annotation data in Del.icio.us can be represented as a set of quadruples:
Where "Time" is the time when "User" tagged "URL" using tag "Tag". Due to the fact that different users may annotate the same URL at different times, the times related to a specific URL form a time sequence TS [t1, t2,... tn]. Given the user required time interval TI= [ts, te]. We define the match of the URL's time sequence TS and the user required time interval TI as follows:
Where te-ts denote the number of URLs annotated in this time interval. Then we extend the basic model to the time-related social browsing model by applying the social browsing algorithm only on the matched tags and their associated URLs. Hereθis set to 50%.
In this section, we give the evaluation of the proposed algorithm. All the discussions below are based on a data set collected from Del.icio.us during May, 2006, which consists of 1,736,268 web pages and 269,566 different annotations.
For evaluating the effectiveness of our similarity measurement, we give several results of constructing a concept for a given tag. As shown in Table 1, the concept of a given tag consists of several highly related tags that are generally synonymous tags, abbreviations or plurality of the given tag. The semantic of the current tag is influenced by the user's previous selections, e.g., "movie" in the "programming" area may mean "screenshot" or "screen capture" instead of "film" or "moving picture" in daily life; the concept of "Brainstorming" in the programming area are more likely to be "mindmap" and "freemind", which are two popular software used for brainstorming. Results show that our similarity measurement can correctly reflect relations between tags in the social annotation environment.
CVS : Versioncontrol, SVN, subversion, control |
Movie: Movies, Film ,Films |
Computer/Gallery: album photogallery fotos |
Programming/Meta: Metaprgramming |
Programming/Movie: screenshot screencapture |
Programming/Brainstorming: mindmap freemind |
Table 2 shows the sub-tags discovered for 20 selected concepts. These concepts are selected randomly from 100 popular topics of Del.icio.us by a group of students in our lab. The concepts are distributed in different areas. Since a lot of annotations in Del.icio.us are related with IT, more selected topics are related to IT. In each box, the first line gives the concept label. For each of the 4 concepts in the first line, we listed only ten subordinate concepts and for each of the rest of concepts, we listed only 5 subordinate concepts due to space limitation. Table 2 implies that our algorithm is able to organize a hierarchical structure of tags as people think in their everyday life. For example, when the user clicks science, our algorithm is able to generate a series of sub categories such as math, physics, psychology, etc., which are mostly meaningful and very distinguishable, and also illustrate the topic most efficiently according to the knowledge of people.
We evaluate the efficiency of our system with a modest machine (Intel Pentium IV 3.0 GHz, 1GB memory, 2 processors). The system is implemented in the java language. Lucene API is also used to build URL and Tag index. We cached the matrix of top 2000 URLs and 2000 tags for reducing the time cost of database accesses. The average time of processing 20 concepts is 1.3 sec.
Figure 12 shows the results of the algorithm with the consideration of the user's profile. The tags in red are owned by the user and the tags in orange are recommended tags for the user. In the experiment, the user's profile consists of 25 tags and 45 URLs and the top 5 user interested tags are "Linux", "media", "video", "JavaScript" and "Java", so the user is likely to be a Web developer or a fan of "media". We see that the recommended tags in Figure 12, such as "ajax", "videos" and "download", are highly related to the user's interests, which implies that the personalized browsing can indeed help users to find their interested resources effectively.
Figure 13 demonstrates the distribution of the tag counts associated with three URLs over time. It can be seen that given a certain URL, the popularity varies as time goes by. Our algorithm can discover the newly emerging resources such as URL 1 and URL 3 which increased to the peak at the beginning. It can also detect the resources which become the hot topics periodically like URL 2. At the same time, we found out that URL 3 was just created in Dec.21 2005 and became the hot topic in the following day, which means people can get the most popular topics by browsing the social annotation with time information.
Date: 2005-12-12 http://www.nist.gov/dads/terms.html |
Date 2006-2-28 http://developer.apple.com/tools/rubyonrails.html |
Date2005-12-22 http://www.exploding-boy.com/2005/12/21/more-free-css-navigation-menu-designs/ |
Programming | Music | Science | Microsoft |
AJAX JavaScripts Ruby rails PHP Python Java Framework C Cpp Dhtml Lisp Perl |
Bittorrent Torrents Ipod Radio MP3 Itunes Guitar Chords Sound Soundfx Player Songbird Indie Drm Lyrics song |
Health Sleep Math Mathematics Physics Quantum Psychology Brain Space Astronomy Algorithms MIT Biology Lectures sicp Evolution Creationism |
XP Tweaks Excel Word Writely asp.net dotnet XBOX MSN WindowsLive Outlook Boot Bootdisk Spyware Vista Longhorn |
Arts | Basketball | Book | Computer |
Graffiti Streetart Museum mus Knots topology Poetry Anvatagrade Artistis Painter |
ESPN Fox Autism Dallas NBA |
Lisp Literature ebook Audiobooks Amzaon Scheme Sicp |
Developers IE favorites Algorithms comupeterscience Spyware Adware |
C | Design | Game | |
Algorithm DataStructure Cocoa Objective Mono Compiler Compilers Visualstudio VS2005 |
CSS Webdesign Flash Art Graphics Fonts Typography Photoshop |
Maps Googlemaps Gmail GreasemonkeyUserscripts Searchengines; GPS geocaching |
Sudoku Puzzle Warcraft worldofwarcraft Videogames Chess Emulation emulators |
Java | Mac | Reference | Web |
Eclipse IDE Framework Xmlhttprequest J2EE Spring UML |
Ipod Itunes Macosx OSX Cocoa Objective Quicksilver |
CSS HTML Howto Tips Ebook ebooks Maps Language Dictionary |
Css Webdesign Ajax Javascript PHP Del.icio.us Delicious |
Social annotation browsing is a recently emerging task and becomes more and more important as the annotations of web resources keep on increasing at a surprising speed. In this paper, we analyze the characteristics of social annotations in three aspects, namely similarity, hierarchy and distribution. Based on observations in these aspects, we propose the ELSABer algorithm for effective social annotation browsing. A prototype system is also implemented based on ELSABer and produces encouraging results. Our main contributions can be concluded as follows:
In the future, we will conduct more user studies for evaluating the effectiveness of our algorithm since browsing problem need more consideration in the view of user. Further more, we should emphasize on how to find more qualified URL resources and utilize existing hierarchical structures such as ODP and WordNet for helping construct more meaningful hierarchical structures for social annotations.
The authors would like to thank IBM China Research Lab for its continuous support to and cooperation with Shanghai JiaoTong University. The authors also appreciate the valuable suggestions of Linyun Fu, Shengliang Xu and Lei Zhang. In the end, the authors would like to thank the three anonymous reviewers for their elaborate and helpful comments.
[1] S. A. Golder, and B. A. Huberman. Usage Patterns of Collaborative Tagging Systems. Journal of Information Science, 32(2), pp.198-208, 2006
[2] J. Schachter. Del.icio.us about page. http://del.icio.us/doc/about, 2004
[3] T. Hammond, T. Hannay, B. Lund, and J. Scott. Social book marking tools (i) - a general review. D-Lib Magazine, 11(4), 2005.
[4] A. Mathes. Folksonomies - Cooperative Classification and Communication through Shared Metadata. http://www.adammathes.com/academic/computer-mediated-communication/folksonomies.html, December 2004
[5] E. Quintarelli. Folksonomies: power to the people. Paper presented at the ISKO Italy-UniMIB meeting. http://www.iskoi.org/doc/folksonomies.htm, June 2005
[6] X. Wu, L. Zhang, and Y. Yu. Exploring Social Annotations for the Semantic Web. In: Proc. of WWW 2006, pp. 417-426, May 23.26, 2006
[7] C. H. Brooks, N. Montanez. Improved annotation of the blogosphere via autotagging and hierarchical clustering. In: Proc. of WWW 2006, pp. 625--632. May 23.26 2006
[8] P. A. Dmitriev, N. Eiron, M. Fontoura, and E. Shekita. Using Annotations in Enterprise Search. In: Proc. of WWW 2006 , pp. 811-817, May 23.26, 2006
[9] P. Mika Ontologies are us: a unified model of social networks and semantics. In: Proc. of ISWC 2005. pp. 522-536, Nov. 2005.
[10] M. Dubinko, R. Kumar, J. Magnani, J. Novak, P. Raghavan, A. Tomkins. Visualizing Tags over Time. In: Proc. of WWW 2006, pp. 193-202, May 23.26, 2006
[11] C. Shirky. Folksonomy. Blog entry at http://www.corante.com/many/archives/2004/08/25/folksonomy.php, August 2004
[12] G. Smith. Atomiq: Folksonomy: social classification. http://atomiq.org/archives/2004/08/folksonomy_social_classification.html, Aug 3, 2004
[13] A. Pothen, H. D. Simon, and K.-P. Liou. Partitioning sparse matrices with eigenvectors of graphs.In: SIAM J.Matrix Anal. Appl., 11(3):430{452, 1990.
[14] J. B. MacQueen. Some Methods for classification and Analysis of Multivariate Observations, In Proc.of 5th Berkeley Symposium on Mathematical Statistics and Probability, 1:281-297 1967
[15] H. -J. Zeng, Q.-C. He, Z. Chen, W.-Y. Ma and J. Ma. Learning to cluster Web search results. In Proc. of SIGIR 2004, 2004
[16] http://www.freepatentsonline.com/5961593.html
[17] B. Christos, K. Vaggelis, M. Ioannis. A Web-page fragmentation technique for personalized browsing.In: Proc. of the symposium on Applied computing pp. 1146 - 1147.
[18] http://www.stumbleupon.com/
[19] Q. Zhao, T.-Y. Liu, S.S. Bhowmick, and W.Y. Ma. Event Detection from Evolution of Click-through Data. In Proc. of SIGKDD 2006.pp.484-493
[20] Del.icio.us tag cloud view: http://del.icio.us/tag/
[21] G. Begelman, P. Keller and F.Smadja, Automated Tag Clustering Improved search and exploration in the tag space In: Proc. of Collaborative Web Tagging Workshop at WWW 2006.
[22] http://www.neuroticWeb.com/recursos/del.icio.us-graphs/
* Part of Rui Li and Shenghua Bao's work of this paper was conducted in IBM China Research Lab.