Entity workshop summary from TREC2008
This is a brief summary of the Entity track workshop from the TREC 2008 conference.
The overall aim of this new track is to perform entity-related search on Web data. Often, users search for speci?c entities instead of just any type of documents. These search tasks address common information needs that are not that well modeled as adhoc document search.
Motivation, goals
The track was motivated to a large degree by the expert finding task at the TREC Enterprise track (that has come to an end in 2008). The Enterprise track has provided a platform to look at one specific entity (experts) from two directions:
- expert finding: identifying a list of people who are knowledgeable about a given topic (”Who are the experts on topic X?“), and
- expert profiling: returning a list of topics a person is knowledgeable about (”What topics does person Y know about?“).
These tasks serve as our starting point. Translating these for the Entity track, we aim to (1) finding entities in the collection (retrieving entities in a particular context) as well as (2) getting insights about entities (retrieving the context for a given entity).
We are aware that several related problems have been or are being looked at various evaluation forums, such as TREC Enterprise, TREC QA, INEX Entity Ranking, TAC summarization, SemEval Web People Search, etc. We need to make sure that the Entity track differentiates itself from these related efforts.
What is an entity?
According to our working definition, an entity is “something that has a homepage”. The URL of the homepage serves as a unique identifier for the entity (”entityID”). At the first year of the track we are focusing of three types of entities: people, organizations, and products.
Discussion at the workshop made it apparent that we need a much more precise definition of what an entity’s homepage is.
- An entity might have multiple homepages, e.g., a person has multiple homepages. Which is the preferred one?
- What is the “entry point”? Let’s take iPhone as an example, is the product’s homepage the entry page or the features page?
- Staying with our iPhone example, is its wikipedia page seen as a homepage of the entity? If yes, systems could possibly “cheat” by always returning the wikipedia page (if we are talking about an entity retrieval type task).
- Also, if the collection is big enough (and it is supposed to be) the entity’s homepage might exist in multiple languages.
Tasks
During the workshop, four candidate tasks were discussed:
- Entity ranking: given a topic, find relevant entities (”computer scientist in the Netherlands that work on information retrieval“).
- Entity distillation: given the name and homepage of an entity as input, return documents that contain key information about the entity (”I’m a marketing/PR person and want to monitor what the key topics are that people discuss in relation to iPhone“).
- Entity relation finding: given the name and homepage of an entity, as well as the type of the target entity, find related entities that are of target type (”organizations that Tom Cruise is related to“).
- Attribute identification: given a list of entities (with names and homepages) return a list of key aspects. For example, if input entities are sport cars, the list of attributes to be returned should include manufacturer, top speed, acceleration, etc.
Out of these tasks, only one is going to run (as a pilot task) in 2009.
After an initial poll, entity relation finding and attribute identification were identified as being of the most interest to participants. Further details of these tasks are up for discussion on this blog.
Entity ranking was eliminated because using the above definition of an entity it sounds a simple web document search task. As to entity distillation, it seems very much like a QA task.
Data and topics
The track will be using the new web collection (aimed at 1 billion documents). It is very likely that only a subset of this collection will be used at the first year.
Topic development and relevance assessments will be performed within the community (and topic authors should make the judgments for their own topics).
Tags: ideas, summary, TREC2008, workshop
Leave a Reply
You must be logged in to post a comment.