ILPS Seminar 2008-2

Jump ahead to one of the following dates:

Unless stated otherwise, all seminar meetings take place in
Room F.009, Informatics Institute, Kruislaan 403, Amsterdam.

  • September 5, 2008; 13:30-14:30
    Manos Tsagkias

    Using Term Clouds to Represent Segment-Level Semantic Content
    of Podcasts

    Abstract

    Spoken audio, like any time-continuous medium, is notoriously difficult to browse or skim without support of an interface providing semantically annotated jump points to signal the user where to listen in. Creation of time-aligned metadata by human annotators is prohibitively expensive, motivating the investigation of representations of segment-level semantic content based on transcripts generated by automatic speech recognition (ASR).

    This paper examines the feasibility of using term clouds to provide users with a structured representation of the semantic content of podcast episodes. Podcast episodes are visualized as a series of sub-episode segments, each represented by a term cloud derived from a transcript generated by automatic speech recognition (ASR).

    Quality of segment-level term clouds is measured quantitatively and their utility is investigated using a small-scale user study based on human labeled segment boundaries. Since the segment-level clouds generated from ASR-transcripts prove useful, we examine an adaptation of text tiling techniques to speech in order to be able to generate segments as part of a completely automated indexing and structuring system for browsing of spoken audio. Results demonstrate that the segments generated are comparable with human selected segment boundaries.

  • September 19, 2008; 13:30-14:30
    Anna Ritchie
    Citation Context Analysis for Information Retrieval

    Abstract

    I will present my thesis work, which investigates taking words from around citations to scientific papers in order to create an enhanced document representation for improved information retrieval. This method parallels how anchor text is commonly used in Web retrieval. The contributions of the thesis are twofold: firstly, a novel document representation is presented, along with experiments to measure its effect on retrieval effectiveness, and, secondly, it documents the construction of a new, realistic test collection of scientific research papers, with references (in the bibliography) and their associated citations (in the running text of the paper) automatically annotated. My experiments show that the citation-enhanced document representation increases retrieval effectiveness across a range of standard retrieval models and evaluation measures; the test collection leaves the door open for more extensive experimentation.

  • September 26, 2008; 13:30-14:30
    Bas Zoetekouw, Pedro Fonseca and Merlijn Sevenster (Philips)
    Experience Processing
    Abstract

    The Experience Processing group of Philips Research develops applications to capture, understand, and improve user experiences. This is a new direction for our group; we used to be called Storage Systems and Applications and were mainly involved in video processing and content management.
    For our new research direction, we are now shifting our interests to the area of merging and interpretation information from different modalities, such as audio, video, (internet) text sources, biosensor signals, etc.

    In this talk, we will give a short overview of Philips Research and the Experience processing group. We will show some applications that we have developed in the past in the area of sports summarization, navigation of news broadcasts, summarizing photo collections, and also some new topics, like detection of moods in movies. Also, we will go a bit more in depth into the MediSearch project.

    In the MediSearch project, we are developing an application to optimize the workflow for radiologists by combining image information (CT and MRI scans) and textual information (radiology reports, encyclopedia, etc.). This provides big challenges, both in terms of image analysis and 3D-reconstruction of the patient’s body as well as in terms of natural language processing.

  • October 3, 2008; 13:30-14:30
    Maarten Clements
    Personalised search in social content systems
    Abstract

    Social content systems contain enormous collections of unstructured user-generated content, annotated by the collaborative effort of regular Internet users. Tag-clouds have become popular interfaces that allow users to query the database by clicking relevant terms. However, these single click queries are often not expressive enough to effectively retrieve the desired content.

    Using both rating and tagging information, we have created a personalized retrieval model, which effectively integrates the personal user preference in the content ranking. We use a random walk model to exploit latent relations between query terms.

    With collaborative annotations from a popular on-line book catalog, I will show that this model outperforms standard tag-based retrieval. I will discuss the implications for different types of tagging systems and discuss the robustness of the model to well known linguistic problems like synonyms and homographs.

  • October 17, 2008; 13:30-14:30
    Frank Nack
    From automatic creativity to nomadic communication
    Abstract

    In this presentation I will outline the development of my research from automatic video-slapstick generation to information generation and maintenance in global/mobile environments. The aim is to describe why and how the human found its way back into AI media research. Based on thoughts on past developments I will use an essential part of this presentation to describe the two research directions I will follow for the years to come – namely ‘Nomadic Information’ and ‘The Researcher’s Workbench’.

  • October 31, 2008; 13:30-14:30
    Willem Robert van Hage
    Evaluation of the OAEI 2006 & 2007 food thesaurus alignment task
    Abstract

    Since 2004 the Ontology Alignment Evaluation Initiative has organized challenges for the evaluating ontology matching technologies.
    These alignment challenges range from matching web-directory structures to complex description-logic ontologies in the medical domain.
    This talk will be about the evaluation technique used for the food thesaurus-alignment task of 2006 and 2007 and the results obtained by the participants.

  • November 7, 2008;
    13:30-14:30

    Loredana Afanasiev
    Surfacing the Deep Web
    Abstract

    The Deep Web refers to content hidden behind HTML forms. In order to get to such content, a user has to perform a form submission with valid input values. The name Deep Web arises from the fact that such content was thought to be beyond the reach of search engines. The Deep Web is also believed to be the biggest source of structured data on the Web and hence has been accessing its contents has been a long standing challenge in the data management community.

    There are mainly two very different approaches to exposing Deep-Web content – the virtual integration approach that has often been pursued in the data management literature, and the surfacing approach. In this talk I will describe a system for surfacing Deep-Web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. This system, called Scuba, was built by a small team lead by Jayant Madhavan and Alon Havely and it exposes large volumes of Deep-Web content to Google.com users. This content contributes to more than 1000 search queries per-second and spans over 50 languages and hundreds of domains.

    I will talk about how the surfacing approach compares to the virtual integration approach; I will present the main challenges of surfacing the Deep Web and the solutions adopted by Scuba; and finally I will talk about my modest contribution to the system.

    References:
    [1] Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, Alon Halevy. Surfacing the Deep Web. VLDB 2008.
    [2] Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy. Harnessing the Deep Web: Present and Future. Under submission.

  • November 14, 2008;
    13:30-14:30

    Joris van Zundert/Karina van Dalen
    eResearch at Huygens Institute – Stylometrics
    Abstract

    In 2004 the department of eResearch was added to the Huygens Institute (which is an institute of the Royal Netherlands Academy of Arts and Sciences). The goal of the new department is to further the methodology and techniques of research into Dutch literary and science history. This presentation will give an overview of how we have been trying to reach these objectives. An overview of past and current projects will be followed by a more in dept recount of three of our studies involving quantitative stylometrics. This will also show how our approach to (for example) authorship attribution is, as some would suggest, radically different from the usual methodology in Dutch literary research.

  • November 28, 2008; 13:30-14:30
    Dolf Trieschnigg
    MeSH up: Effective Text Classification for Improved Document Retrieval
    Abstract

    The Medical Subject Headings (MeSH) thesaurus has been used for quite some time to manually classify biomedical publications in the MEDLINE database, to enable searching citations. Despite the more recent embracement of full-text search, these MeSH concepts still provide a useful mechanism for searching and organizing biomedical information. Having an automatic MeSH classification method seems attractive to replace laboursome manual work, but can it live up to it? Moreover, can such an automatic method be effectively employed to improve IR?

    In this presentation several automatic MeSH classification methods are evaluated to determine their capability of reproducing manual MeSH classifications. Additionally, we evaluate if these automatic approaches can be effectively used to classify a information need expressed in free text in terms of MeSH concepts. We show that there’s a clear relation between a system’s performance in reproducing manual classification and possible gains in information retrieval.

  • December 12, 2008; 13:30-14:30
    open slot

  • December 26, 2008; 13:30-14:30
    open slot

  • January 16, 2008; 13:30-14:30
    Pablo Cesar
    Towards Next-Generation Video Sharing Systems
    Abstract

    Media consumption is an inherently social activity, serving to communicate ideas and emotions across both small- and large-scale communities. The migration of the media experience to personal computers retains social viewing, but typically only via a non-social, strictly
    personal interface. As a discussion starter we survey previous efforts such as social interactive television, WebTV interfaces, and online video sharing systems. The purpose of such interactive discussion is to identify key factors for next-generation video sharing systems: content
    modeling, the implications and benefits of social networks, and the importance of the contextual settings of the viewer to distribute and control video material. Finally, we present an architecture and implementation for media content selection, content (re)organization, and
    content sharing within a user community that is heterogeneous in terms of both participants and devices. The system as well allows the user to enrich the content as a differentiated personalization activity targeted
    to his/her peer-group. The final goal of the session is to identify future directions for improving share experiences around media consumption.

  • January 23, 2008; 13:30-14:30
    open slot