WebCLEF 2005 workshop report

Vienna, September 23, 2005
Report by Diana Santos and Maarten de Rijke
Issues related to WebCLEF 2005

The first question was to hear some feedback of why people had or had not participated in WebCLEF. Paul McNamee explained that he was not interested in a homepage finding task, if it had been a more informational kind of query (and not a homepage) he would participate. Jaques Savoy, in addition to time constraints, mentioned that it was not clear what was expected in terms of relevance assessment. (i.e., lack of sufficient information) Dominic Laurent also mentioned too little time.

The second question discussed was the encoding issues problem, which was a headache for both organizers and participants, and Maarten asked for help/suggestions on how to deal with this. Donna Harman suggested that one checked with LDC, probably they would have some tools for that. Someone asked for more precision about what exactly the problem was, and the problem is that character encoding is very varied in the (at least European) Web, and that the information about it in the metadata HTTP header is very often wrong, because it is automatically produced and people do not know/care to set it right. Diana Santos suggested that some studies about which tools were used and its relationship with character encoding were in place after the EuroGOV collection was available, anda that in later campaigns the organization should deal with this by making sure that character encoding was right.

The third topic discussed was topic development. Maarten de Rijke said they had used 2 days for topic development and asked whether the same amount of (total) time was used. A Spanish group (Jaén? Alicante?) mentined they had distributed topic development, and also 4/5 topics an hour. As for topic development, Thomas Mandl suggested that the topic list presented by the tools one had to create topics was too short (20 hits) and should be extended for next years.

The fourth subject was: why was there such a big difference between multilingual and mixed monolingual tasks? Any ideas? Maarten explained that as simultaneous organizers and participantes, Univ. Amsterdam only produced a non-Web-specific run, as a sort of baseline. Thomas Mandl could not answer this question, because for their system the multilingual run was actually comparable (or even better) to the monolingual runs. On the other hand, Stephen Tomlinson could not answer this either because they had only participated in the mixed monolingual tracks. One hypothesis raised by Maarten de Rijke was to analyse the translation engines used, and eventually provide translations.

Methodological issues

The multilingual user model question was raised: which users would crosslingually search in European Government pages? Donna Harman explained that for search in GOV pages at TREC, the typical users were retired people who wanted to know about social security benefits, medical organizations etc. Diana Santos hypothesized that for monolingual Gov queries information on how to pay taxes was the most typical, therefore transactional topics and not informational ones (in Broder's sense). Maarten suggested that in a crosslingual setting how to obtain resident permits for immigrants would be a typical user need, but noted that log studies (and the availability of the logs themselves) is still missing. Other informational topics would be for example gather information about the Dutch prime minister. Donna Harman asked whether it would be possible to compare legislation across countries, like for example about genetically modified seeds. And this led to the suggestion of focusing on specific domains crossborder. Other ideas were travel, strat up a company, civil liberties rights. A remark was done that the WebCLEF collection was very much a kind of 25 islands with very little bridges among them. Still Finland and Sweden and Holland had some pointers to other governments. A study of the link structure in the collection is still to be done. Thomas Mandl mentioned that the EU pages were also included in the WebCLEF collection and they were highly interlinked among different languages.

The second topic was the problem of the incompleteness of "relevant in translation". There was no way to know whether there were duplicates or near duplicates in the collection, and since the relevance assessments were done neforehand (at the time of topic creation) there was no way to know whether systems had been penalised by finding a correct page not in the solution. Thomas Mandl stated that the only way to solve this would go towards a adhoc solution (i.e., create pools and assess like in the adhoc track).

The third subject mentioned was the problem of the crawl, which was assumedly not perfect, and whether the participants wanted a new crawl or the old one. Since here there was split opinion on the matter, Maarten de Rijke suggested that he was willing to provide patches for the Portuguese and Spanish parts (the weakest in the crawl), but that to satisfy the other participants he would maintain the rest of the collection, although new patches could be added. Related to this was the question of PDFs (and other non-HTML files). This year they were excluded. Luis Sarmento claimed that it was relatively simple to extract pdf to text and that pdf information should not be excluded because the best qaulity information was provided in pdf. Someone asked whether this was relevant to a crosslingual task in governement, and Maarten gave the example of "detailed instructions on how to answer a call", typically in PDF. Another issue we should discuss was whether we should aim for more data, better coverage, more hits, more size...

Then the important issue of what to do to newcomers was discussed: Should they get the topics of CLEF 2005 although they had not participated in building them? Carol Peters said her experience was that being generous always paid. And in general everyone agreed to provide to CLEF newcomers the same data as the "oldcomers". Which, by the way, already had experience, so were always better off, as noted by Diana Santos. Also, new WebCLEF contests would always require cooperative creation of more topics, divided by all participants, as Maarten de Rijke emphasized.

WebCLEF 2006

Maarten de Rijke wanted to know whether there were new suggestions for pilot (or main) tracks. He himself suggested the issue of blogs, i.e. a new type of content, claiming that, although new companies serve blog facilities, this is probably the least interesting thing that can be done with blogs. The idea would be have tasks in WebCLEF to classify, detect trends, tag them automaticaly, etc. He showed (live) that some blog services already attach mood with blogs in terms of emoticons (LiveJournal), and he showed a demo that correlated mood with time (Moodgrapher) A three person committee was arranged to deal with this subject: Luís Sarmento, César de Pablo-Sánchez and Anselmo Pen~as. Maarten also suggested that marketeers and politicians could also be interested in blog analysys.

Suggestions for future WebCLEF, done before in the discussion, were:

  • (MdR) The idea for WebCLEF was to have a track across languages and across blogs.
  • (Diana Santos) Find addresses crosslingual. A typically IE task, for email addresses, or postal addresses, which are generally difficult to find on the Web.
  • (Donna Harman) Try subdomain IR: specifically in crosslingually-relevant domains
  • (???Don't remember who did it, sorry) Translated page finding task.
  • (Maarten de Rijke) A classification task.
  • (Thomas Mandl) Bring it closer to a adhoc task.

Finally, the session was ended with a request for help from Maarten de Rijke: he wanted help to analyse the results, find duplicates in other languages, do link analysis, do studies of how people use governement sites.

Comments/corrections to Maarten de Rijke