Concept Selection Benchmarks for Video Retrieval

This page presents two benchmarks designed for assessing automatic concept selection algorithms in concept-based video retrieval.

The benchmarks allow for the assessment of concept selection independently of detector performance. Practically, they may help decide where to threshold the number of selected concepts, how to rank concepts, and allow comparison of concept selection methods. Each benchmark consists of a set of queries, with each query mapped to a lexicon of 450 concepts using either human or collection knowledge (depending on the benchmark). A candidate concept selection algorithm can be assessed by comparing the similarity between the set of selected concepts and the benchmarks using the evaluation script provided in the download section below.

Benchmark Descriptions

The Human Benchmark

People can have a wide range of associations with a concept, depending on context and personal characteristics. Nevertheless, there exists a common understanding of concepts that – the goal of the human-generated benchmark was to capture this common understanding, as opposed to the wider range of individual associations. Therefore two focus group experiments were used to identify visual concepts that humans would consider useful to answer a query.

The human benchmark consists of 122 topics manually linked to related semantic concepts. Reliability was assessed by analyzing agreement between the two focus groups over 30 overlapping topics. The groups agreed on which concept was the most appropriate for a topic in 80% of the cases.

The Collection Benchmark

Labeling video collections with truth judgments for hundreds of concepts and tens of topics requires large scale annotation efforts. Once annotations are completed, they can be used to deduce which concepts are relevant to a topic according to the collection-specific concept distribution, The truth annotations from the TRECVID benchmark, LSCOM effort, and MediaMill Challenge were used to back-generate relevant concepts from the TRECVID 2005 development collection. The collection benchmark consists of 56 topics automatically linked to related semantic concepts.


The benchmark data sets and evaluation script can be downloaded here:

Instructions for use are in the README file. A working Perl installation is required to run the evaluation script. If you have any questions please contact Bouke Huurnink, bhuurnink (at) uva (dot) nl

When using resources from this page, please cite the paper as Assessing Concept Selection for Video Retrieval, B. Huurnink, K. Hofmann, and M. de Rijke. In: ACM International Conference on Multimedia Information Retrieval (MIR 2008), October 2008. [PDF, BIB]