Training topics discussion
The purpose of this page is to discuss issues related to training topics.
Issues identified so far (list will be updated to reflect the discussion):
- historical entities (no homepage at all); we propose to ignore these
- most entities have wikipedia pages; this itself is not a problem as wikipedia pages are good but not primary answers. Yet, the task is much more challenging if entities to be found do not have a wikipedia page
- official vs. non-official pages (esp. for people and products); fan pages are often of much better quality than outdated “official” pages. Should the “most informative” treated as the primary homepage of the entity?
- region specific pages (esp. product pages); we need to assign an “area code” to our user. Probably the most convenient is a user from the US
- many entities drawn from the same support, i.e., the query is easly answerable from a single document (often a Wikipedia page); topic creators will avoid it, as much as possible. But, selecting the proper entities from a long-enough document is quite challenging of its own.
- entity names (esp. for people and products) E.g., “Mac OS/X” vs. “Mac OSX”; strings need to be normalized before comparisons (where “normalized” is yet to be defined)
3 Responses to “Training topics discussion”
Leave a Reply
You must be logged in to post a comment.
June 11th, 2009 at 11:35 am
New Issue:
A number of sites for the example entities (i.e. gillette, nokia) are very “flashy” but have almost no textual content.
* The retrieval of relevant documents for these type of sites will (probably) be more difficult than for sites with more content.
* In relation finding: if a flashy site is the target entity it will also be harder to find.
So for these kind of sites the emphasis shifts from traditional content based retrieval, to retrieval based on structure, url features, pagerank.
It seems that it would be difficult for a single system to perform well on both flashy and content based sites.
Question: do we want this extra difference in topic difficulty and if so will the distinction be taken into account in evaluation?
What will the assessors policy be in these cases?
June 18th, 2009 at 2:18 pm
Let’s restrict ourselves to textual content. That means we need to make sure that flash content is not displayed for assessors.
July 11th, 2009 at 8:32 am
I wonder when the topics will release.
Thanks