What do users values in web search engine result summaries? To investigate, we conducted a series of experiments in which surveyed users answered questions about summaries that had been deliberately manipulated to exhibit certain characteristics.
We use the term summary to refer to the title, abstract, and URL displayed for each web search result. Abstracts in web search summaries are usually query-biased [3]; that is, the abstract for a document depends on the query used to retrieve it.
We conducted a series of experiments in which online surveys were given to randomly selected Yahoo! Search users. In each survey, approximately 2500 users were randomly assigned to two or more experimental groups. Each user was presented with a hypothetical search situation and a query that might be used in that situation. Users were then shown the summary of a search result for that query and asked to answer questions about it.
Rather than using actual search engine summaries, we presented users with editorially-written summaries that varied only by a particular attribute. Each time we manipulated one attribute, other attributes were kept constant. To construct the abstracts, editors typically began with a web page retrieved for a particular query, then manually identified snippets1 to be included, modifying them to create the desired test condition. Editorial creation of abstracts enabled us to control which parts of the document were included and insure consistency. Titles were held constant across all conditions, and consisted of the queries themselves. Table 1 shows an example of the variant abstracts created for one dimension of abstract quality, the "choppiness" of the text.
Complete sentences | Interested in a full-blown overview of the Honda Accord from its birth to the current version? ... Initially available only as a hatchback, the Accord rode a 93.7-inch wheelbase and sported a clean, uncluttered body style. |
Incomplete Sentences, Good Breakpoints | ... a full-blown overview of the Honda Accord from its birth to the current ... available only as a hatchback, the Accord rode a 93.7-inch wheelbase, weighed about 2,000 pounds and sported a clean, uncluttered ... |
Incomplete Sentences, Bad Breakpoints | ... blown overview of the Honda Accord from its birth to the current version? Then you'll ... to park. Initially available only in two-door hatchback form, the Accord rode a 93.7-inch wheelbase, weighed ... |
Twelve survey questions were grouped into three areas: Nature of page/readability (e.g. "The information is presented in an uncluttered manner"); relevance ("Would you click on this result") and trust ("How likely are you to trust the information on the resulting web page"). In order to keep the survey short, each participant saw only the questions from one of the areas. Responses to the Relevance questions were Yes or No; the other responses were answered by an agreement scale of 1 to 5. Each participant also answered questions for two scenarios. The result was that on average, each question was answered by about 1000 users for each condition. All scenarios described an informational search goal [2], and all queries were two words long.
We ran six experiments, each modifying one of the following summary attributes and holding all others constant:
Table 2 shows an overview of the results. Each row refers to an experiment in which only the specified attributed was manipulated. An entry in the "Direct Effect" column indicates that there was a statistically significant difference (at 95% confidence) among the treatment conditions in user responses on the topic(s) that were directly related to the manipulation. For example, the first row shows that increasing the "choppiness" of the text led to different user responses on two survey questions: one asking about readability and one assessing clutter. An entry in the "Indirect Effect" column indicates that there was a difference in user response on an aspect of quality not immediately related to the manipulated attribute. The final column shows which condition was preferred.
Attribute | Direct Effect | Indirect Effect | Best Condition |
---|---|---|---|
Text Choppiness | readability, clutter | -- | complete sentences |
Snippet Truncation | -- | trust, understand why retrieved, perceived relevance | end removed |
Query Term Presence | -- | -- | n/a |
Query Term Density | -- | -- | n/a |
Abstract Length | -- | clutter | short |
Genre | conveys genre | trust, likely to have information | include genre cues |
We observed two interesting phenomena. First, four of the experiments found no differences in the directly related questions. For example, based on our previous field study, we expected that a very high density of query term hits would be perceived as "spammy," and lead to lower scores on questions related to trust. However, this did not occur.
Second, there were three experiments in which users perceived differences in a dimension not directly related to the attribute being manipulated. For example, in the genre experiment, we expected that the presence of genre cues would cause higher ratings for the statement "I know what kind of page to expect if I were to click on the result," and it did. But this condition also received higher ratings for how likely the user is to trust the result, and whether the user believes that the page will have the information s/he is looking for.
By manipulating particular attributes of search result summaries while asking users to indicate how well the results meet certain quality criteria, we have started to understand what users value in summaries.
We found that neither abstract length nor query term density had a significant effect on perceptions of quality when users read summaries, though we expect these factors play a role when users scan result pages, as eyetracking research suggests [1]. On the other hand, text choppiness and truncation affected user ratings of several factors, not only readability (as one might expect), but also trust in the results and understanding why the page was retrieved.
Perhaps our most interesting finding was that providing genre cues in abstracts caused users to have a more favorable impression not only of how well the abstract conveyed the type of page expected but several other attributes as well. This "genre halo effect" increased how much users trusted the result and how likely they felt it would contain the information they were seeking.
1. A snippet is a contiguous piece of text extracted from the document, though some search engines have used this term to refer to the entire abstract.
[1] Eye tracking study: An in depth look at interactions with Google using eye tracking methodology, Technical Report, Enquiro, EyeTools and Did-It, Jun 2005.
,[2] Understanding user goals in web search, WWW '04 pages 13-19, New York, NY, USA, 2004.
,[3] Advantages of query biased summaries in information retrieval, SIGIR '98 pages 2-10, New York, NY, USA, 1998.
,