WebCLEF 2007 IQ
Table of contents
Intro
The WebCLEF 2007 IQ task (here IQ stands for Informational Queries) combines insights gained from previous editions of WebCLEF and the WiQA 2006 pilot, and goes beyond the navigational queries considered at WebCLEF 2005 and 2006. At WebCLEF 2007 we consider so-called undirected informational search goals in a web setting: "I want to learn anything/everything about my topic. A query for topic X might be interpreted as "Tell me about X".
Brief history
The two tasks, what was good, what was bad and why we want to merge them.
WebCLEF 2005-2006
WebCLEF 2005 was organized with the purpose to facilitate the evaluation of multilingual retrieval systems. It used the EuroGOV collection: web documents crawled from European governmental sites in the domains at, be, cy, cz, de, dk, ee, es, eu.int, fi, fr, gr, hu, ie, int, it, lt, lu, lv, mt, nl, pl, pt, ru, se, si, sk and uk. The test set consisted on 575 monolingual and bilingual known-item topics (Home page finding and Named page finding).
Details:
- WebCLEF 2005 overview paper
The WebCLEF 2006 task used the same collection. Beside the topics from 2005, the test set contained 125 new manually created topics and 1,620 known-item topics generated automatically using different techniques. The main findings in 2006 were that that current CLIR systems are very effective, retrieving on average the target page in the top ranks. Manually constructed topics result in higher performance than and automatically generated ones. And finally, the resulting scores on automatic topics provide a reasonable ranking of the systems, showing that automatically generated topics are an attractive alternative in situations where manual topics are not readily available.
Overall, WebCLEF 2005 and 2006 left us with a large (and we believe sufficiently large) test set of known-item topics.
Details:
- WebCLEF 2006 overview paper
WiQA 2006
The WiQA 2006 pilot withing the CLEF 2006 Question Answering task was organized with the purpose of evaluating focused information access systems on a complex multilingual collection with a rich structure: Wikipedia. In the task, automatic systems were expected to find text snippets relevant, interesting and novel for a given topic (a Wikipedia article). WiQA 2006 featured three monolingual (Dutch, English and Spanish) and one bilingual (Dutch-English) task.
The main finding were ...
- Participants thought it was a challenging but interesting task
- The overall inter-annotator agreement was far from perfect. For 1428 snippets with double assessments the assessors agreed in 73% of the cases on importance and in 73% of the cases on novelty of the retrieved snippets. The kappa between the 8 assessor pairs ranged between 0.13 and 0.71.
- Retrieval unit length was defined as at most two sentences. Due to errors in automatic sentence splitting in the annotations of the collections, systems often returned unnaturally long snippets (usually long itemized lists).
- Evaluation too complicated (four criteria) ...
Details:
- WiQA: Evaluating Multi-lingual Focused Access to Wikipedia, Valentin Jijkoun and Maarten de Rijke. In: EVIA 2007, 2007.
The WebCLEF 2007 task
We take the following as key starting points for defining a CLEF task:
- The task should correspond as close as possible some real-world information need with a clear definition of a user;
- Multi- and cross-linguality should be natural (or even essential) for the task;
- The collection(s) used in the task should be the source of choice for the user's information need;
- Collections, topics and assessors' judgements, resulting from the task should be re-usable in future; and finally,
- The task should be challenging for the state-of-the-art technology.
Task model
Our user is an expert writing a survey article on a specific topic with a clear goal and audience, for example, a Wikipedia article, or a state of the art survey, or an article in a scientific journal. She needs to locate items of information to be included in the article and wants to use an automatic system to help with this. The user does not have immediate access to offline libraries and only uses online sources.
The user formulates her information need (the topic) by specifying:
- Short topic title (e.g., the title of the survey article)
- Free text description of the goals and the intended audience of the article
- A list of languages in which the user is willing to accept the found information
- Optional list of known sources: online resources (URLs of web pages) that user considers to be relevant to the topic and information from which might already have been included in the article
- Optional list of Google retrieval queries that can be used to locate the relevant information; the queries may use site restrictions (see examples below) to express user's preferences.
An example of an information need:
- topic title: Significance Testing
- description: I want to write a survey (about 10 screen pages) for undegrad students on statistical significance testing, with an overview of the ideas, common methods and critiques. I will assume some basic knowledge of statistics.
- languages: English
- known sources:
http://en.wikipedia.org/wiki/Statistical_hypothesis_testing;http://en.wikipedia.org/wiki/Statistical_significance - retrieval queries:
significance testing;site:mathworld.wolfram.com significance testing;significance testing pdf;significance testing site:en.wikipedia.org
Another example of an information need (this one much more multilingual):
- topic title: electoral search Europe
- description: I need to put together a survey of current electoral search engines in Europe, as part of a research proposal. I am interested both in finding out about what exists in the various countries around Europe and in learning about multilingual electoral search engines.
- languages: English, Dutch
- known sources:
http://nl.wikipedia.org/wiki/VerkiezingsKijker;http://nl.wikipedia.org/wiki/Lijst_met_zoekmachines - retrieval queries:
electoral search;site:search.eci.gov.in electoral search;technology site:verkiezingskijker.nl
One more example of an information need:
- topic title: Totoro
- description: I want to write an essay about the Totoro character (from the Ghibli movie called "My Neighbor Totoro"). I am keen to find out biographical facts about the character and to include "fan" reports.
- languages: Dutch, English, German
- known sources:
http://en.wikipedia.org/wiki/My_Neighbor_Totoro;http://de.wikipedia.org/wiki/Tonari_no_Totoro - retrieval queries:
Totoro character;Totoro site:nausicaa.net;Totoro site:ghibliworld.com;Totoro site:imdb.com
The user expects the system to return a list of text snippets with links to their online sources. The snippets should be relevant, important and interesting, as perceived by the user. The snippets can originate from one of the online resources found by the system (excluding the "known sources"). The degree to which the information need is satisfied is measured by the user as the number of distinct atomic facts that the user includes in the article after analyzing top snippets returned by the system.
Defined in this way, the task model corresponds to addressing undirected informational search goals, that are reported to account for over 23% of web queries (see D. Rose and D. Levinson, Understanding user goals in web search, WWW'04).
Particpants will be expected to develop a small number of topics (and to assess the systems' responses). Development and assessment guidelines will be released at a later stage.
Data collection
In order to turn the assumed user model in a description of an IR evaluation excercise, we need to fix the collection: the set of all possible information sources. In order to keep the idealized task as close as possible to the real-world scenario (i.e., there are many relevant topics) but still tractable (i.e., the size of the collection is manageable), our collection is defined per topic. Specifically, for each topic, the subcollection for the topic contains the following set of documents along with their URLs:
- all "known" sources specified for the topic;
- top 1000 (or less, depending at the actual availability) hits from Google for each of the retrieval queries specified in the topic, or for the topic title if the queries are not specified.
For each online document included in the collection, its URL, the original content retrieved from the URL and the plain text conversion of the content are provided. The plain text conversion is only available for HTML, PDF, Postscript and Word documents. For each document, the subcollection also provides its origin: which query or queries were used to locate it and at which rank(s) in the Google result list it was found.
System response
For each topic description, a system responds with a list of plain text snippets extracted from the sub-collection of the topic. Each snippet indicates what document of the sub-collection it comes from. The total length of all plain text snippets returned for a given topic should should not exceed N characters.
Assessment of the responses
In order to comply with the task model, the manual assessment of the responses of the systems will be done by the topic creators. The assessment procedure will be somewhat similar to assessing answers to OTHER questions at TREC 2006 Question Answering task [ref].
The assessment will be blind. For a given topic, all responses of all system will be pooled into anonymized sequence of text segments. Duplicate and overlapping text segments will be automatically merged to reduce load on the assessors. For a set of responses of one or more systems for a given topic, the assessor will make a list of nuggets, atomic facts that she thinks should be included in the article. A nugget may be expressed by one or more returned snippets in the responses. Each returned snippet may express one or more nuggets. The assessor will use a GUI to mark character spans in the snippets and link each span with the nugget it expresses (if any).
(Example)
Similar to INEX and to some tasks at TREC (i.e., the 2006 Expert Finding task) assessment will be carried out by the topic developer, i.e., by the participants themselves.
Evaluation measures
The evaluation measures for the task are based on standard precision and recall. We distinguish nugget-based and character-based measures.
- Nugget-based (resp., character-based) Recall: the number of the all identified nuggets (resp., their character length) which are covered by the snippets of a system S, divided by the total number of nuggets (resp., their total character length)
- Precision: the number of characters that belong to at least one span linked to a nugget, divided by the total character length of the system's response.
WebCLEF 2007 IQ in questions (and answers)
- Q: Why does the WebCLEF 2007 IQ task diverge so much from WebCLEF 2006 and Web IR in general?
- A: At WebCLEF we have developed plenty of known-item topics, so we need to move on to other types of information needs. That explains the difference with WebCLEF 2005 and 2006. As to research on Web IR (in the TREC, CLEF, NTCIR setting), there is attention to issues of scale (addressed at the Terabyte track and the Million Queries track) and increasingly there is attention to different types of queries. And that's the background against which you should see the WebCLEF 2007 IQ task: informational queries are an important set of web queries (see the Rose and Levinson study, and the early Broder paper from SIGIR Forum). The challenge for 2007 was to come up with a natural and do-able task centered around informational queries.
- Q: How is WebCLEF 2007 IQ different from the so-called OTHER questions at the TREC QA track?
- A: The evaluation methodology of WebCLEF 2007 IQ does indeed use ideas from the evaluation of "OTHER" questions at TREC QA track. Unlike TREC QA, WebCLEF 2007 IQ uses web data and defines a clear task model: the user is not simply collecting "interesting facts" about a topic, but is gathering material with a clear purpose: writing an survey article. We believe that a successful WebCLEF 2007 IQ system can be a useful tool in everyday web search.
- Q: WebCLEF 2007 IQ looks more similar to WiQA 2006 than to WebCLEF 2006. Why this name, then?
- A: The WiQA pilot made a valuable contribution to the name of the task as well: it provided the "IQ" part.
- Q: To me, working with web data means working with lots of data and working with link structure. I don't really see that happening at WebCLEF 2007 IQ.
- A: True. We think that if you are into lots of data and issues of scale, you should wait for the next edition of the Terabyte track (in 2008 or 2009). In this edition of WebCLEF, we want to be fairly close to a realistic task that involves web data and view existing commercial web search engines as a black box that we can use as a building block. This still leaves plenty of web specific features for participants to deal with (document structure, multiple languages, diversity, authoritativeness, noise, font encodings, etc, etc). WebCLEF 2007 IQ provides researchers interested in working with web data with an opportunity to focus their system building efforts on addressing these issues rather than on dealing with issues of scale.