This page contains information about some of the software and resources produced and released by members of the Information and Language Processing Systems group.

  • A Grid-Enabled Lucene
    We implemented Grid-specific classes for Lucene, to let Lucene interact with files on a Grid (both for indexing and retrieval), as described in Deploying Lucene on the Grid by Edgar Meij and Maarten de Rijke.
  • A Language Modeling Extension for Lucene
    We have extended Lucene to use language modeling for scoring documents.
  • A Toy Corpus of SPAM in Blog Comments
    A small collection of 50 blog pages, with 1024 comments; manual classifications of these comments as spam or non-spam (67% are spam). For questions, contact Gilad Mishne.
  • Answer Type Classification
    This page provides the guidelines for answer type classification as well as a text file with 1371 classified questions.
  • Arabic Blogs
    A set of Arabic weblog data, collected by Woiyl Hammoumi.
  • Arabizi transliteration
    Arabizi-to-Arabic transliteration software and Arabizi-English bitext used for OpenMT 2015, and presented in the COLING 2016 Workshop on Noisy User-generated Text (WNUT2016).
  • Blog Wishlists
    A small collection of 91 blogs, with the wishlists of their authors.
  • Comerda
    CoMeRDA is an aggregated search system that provides a webbased user interface with multiple options for visualizing search results. Users are able to search, filter and bookmark results form different document collections.
  • Compound word splitter for Dutch
    Compound splitter splits compound words into parts (e.g., hypotheekrenteaftrek into hypotheek, rente and aftrek). It can be used as a Perl module or a standalone network server. Only the Dutch language is currently supported.
  • Concept Selection Benchmarks for Video Retrieval
    This page presents two benchmarks designed for assessing automatic concept selection algorithms in concept-based video retrieval.
  • Dataset for Clicks, Attention and Satisfaction Model and Metric
  • Dataset for Contrastive Theme Summarization
    The collection of articles consists  articles from New York Times for SIGIR 2015 paper "Summarizing Contrastive Themes via Hierarchical Non-Parametric Processes". 
  • DBpedia annotations and features (ISWC 2009, JWS 2010)
    This page contains the data files that were used in the ISWC 2009 paper "Learning Semantic Query Suggestions", by E. Meij, M. Bron, B. Huurnink, L. Hollink, and M. de Rijke.
  • Dynamic Query Modeling for Related Content Finding
    Dataset that was created for the SIGIR 2015 paper Dynamic Query Modeling for Related Content Finding on subtitles by Daan Odijk, Edgar Meij, Isaac Sijaranamual and Maarten de Rijke. In the paper, we present a query modeling approach to find related content in a live TV news setting.
  • Early detection of topical expertise
    This dataset contains the features extracted from the Stack Overflow dataset, used in our SIGIR2015 paper on Early detection of topical expertise in community question and answering.
  • embedding-utils
    Python library with functions to read and write word2vec vectors.
  • Expert Profiling in a Knowledge-intensive Organization
    The TU Webwijs expert profiling and expert finding test collection. It is an updated release of the UvT collection. It has two sets of relevance assessments, original profiles as selected by the experts in TU Webwijs, and graded relevance assessments collected in a self assessment experiment described in the paper "Expert Profiling in a Knowledge-intensive Organization".
  • Gen&Topic data
    Evaluation set created for the analysis of genre and topic differences on SMT quality, as published in ACL2015.
  • Ground truth set for monitoring shifts in vocabulary over time
    Ground truth data that goes with "Ad Hoc Monitoring of Vocabulary Shifts over Time", Tom Kenter, Melvin Wevers, Pim Huijnen, Maarten de Rijke, CIKM 2015.
  • Historic Document Retrieval Resources: 17th century Dutch
    To support historic document retrieval in general, and retrieval in 17th century Dutch texts in particular, we are making available a document collection, together with topics and qrels for those topics.
  • Hypergeometric Language Models for Republished Article Finding
    The ground truth.
  • Information Retrieval Resources for Bahasa Indonesia
    To support document retrieval in Bahasa Indonesia, we are making available a Porter stemmer for the language, a stop word list, as well as two document collections, together with topics and qrels for those topics.
  • KIEM project: Self Organising Archives
    We make the annotations developed within the KIEM project: Self Organising Archives available.
  • Lerot: Online Learning Framework
    This project is designed to run experiments on online learning to rank methods for information retrieval.
  • Linking Online News and Social Media
    This is the ground truth for linking online news stories and social media.
  • Living Labs for IR Evaluation API
    Living labs is a new evaluation paradigm for information retrieval.
  • Merdes
    The source code to set up a subjunctive exploratory search interface as used in the experiments for the SIGIR2012 paper: A Subjunctive Exploratory Search Interface to Support Media Studies Researchers.
  • Model checking for XML query evaluation
    XMChecker (XML Model Checker) implements Simple XPath query evaluation via CTL.
  • Multilingual movie dialogues
    Annotated multilingual movie dialogues used for analysis of the impact of conversational aspects on SMT quality, as published in COLING2016
  • pyndri
    pyndri is a Python interface to the Indri search engine (
  • Query-dependent Contextualization of Streaming Data
    This dataset of tweets and contextual annotations was used for the paper: N. Voskarides, D. Odijk, E. Tsagkias, W. Weerkamp, and M. de Rijke. Query-dependent Contextualization of Streaming Data. In: 36th European Conference on Information Retrieval (ECIR’14).
  • Ranking Related Entities: Components and Analyses
    A number of resources used in the paper Ranking Related Entities: Components and Analyses.
  • Reputation Polarity Resources
    Resources for RepLab2012
  • Resources for example based entity search in the web of data
    Resources for example based entity search in the web of data
  • Result Disambiguation in Web People Search: Data and ground truth
    This page contains data and ground truth used in this the ECIR2012 paper on Result Disambiguation in Web People Search.
  • Social networking spam
    Spam detection in social networking sites is a challenging problem, mainly due to the volume of messages and the pace with which their contents change. In this paper we propose a method for spam detection based on the HITS algorithm. To evaluate our models we use a dataset from the largest Dutch social networking site (Hyves).
  • Source Code Retrieval
    These are the datasets used for the experiments with conceptual retrieval of source code.
  • Ssscrape: a system for collecting dynamic web data
    Ssscrape stands for Syndicated and Semi-Structured Content Retrieval and Processing Environment. Ssscrape is a framework for crawling and processing dynamic web data, such as RSS/Atom feeds.
  • Stemmer and Stopping in Hungarian
    We provide two stemmers for Hungarian: a light stemmer and a heavy stemmer.
  • Timex Annotation System
    TimexTag is a modular system for recognition and interpretation of temporal expressions in English text.
  • TU expert collection
    The TU expert collection is based on the Webwijs (“Webwise”) system developed at Tilburg University (TU) in the Netherlands. It is an update of the UvT expert collection
  • Twitter language identification
    Language identification on Twitter data is a challenging task. In this paper, we train TextCat on a set of English, German, French, Dutch, and Spanish tweets and show that retraining helps a lot, achieving up to 95% accuracy on English, compared to 88% using a model trained on non-Twitter data.
  • Twitter to concept annotations for the WSDM paper “Adding Semantics to Microblog Posts”
    This page contains the dataset that was created for the WSDM 2012 paper Adding Semantics to Microblog Posts by Edgar Meij, Wouter Weerkamp and Maarten de Rijke. In the paper, we evaluate various methods for automatically identifying concepts (in the form of Wikipedia articles) that are contained in or meant by a tweet.
  • Web FAQ collection
    We have made available for research purposes the Web FAQ data described in V. Jijkoun and M. de Rijke. Retrieving Answers from Frequently Asked Question Pages on the Web, CIKM 2005.
  • Weblog Post Moods
    A collection of weblog posts from LiveJournal, with the original mood annotations.
  • xtas, the eXtensible Text Analysis Suite
    The eXtensible Text Analysis Suite (xtas) provides NLP functionality such as named-entity recognition, parsing, document clustering and topic models, through Python (synchronous/asynchronous) and REST APIs.

Older material: