DBpedia annotations and features (ISWC 2009, JWS 2010)

This page contains the data files that were used in the ISWC 2009 paper “Learning Semantic Query Suggestions“, by E. Meij, M. Bron, B. Huurnink, L. Hollink, and M. de Rijke. The abstract of the paper:

An important application of semantic web technology is recognizing human-defined concepts in text. Query transformation is a strategy often used in search engines to derive queries that are able to return more useful search results than the original query and most popular search engines provide facilities that let users complete, specify, or reformulate their queries. We study the problem of semantic query suggestion, a special type of query transformation based on identifying semantic concepts contained in user queries. We use a feature-based approach in conjunction with supervised machine learning, augmenting term-based features with search history-based and concept-specific features. We apply our method to the task of linking queries from real-world query logs (the transaction logs of the Netherlands Institute for Sound and Vision) to the DBpedia knowledge base. We evaluate the utility of different machine learning algorithms, features, and feature types in identifying semantic concepts using a manually developed test bed and show significant improvements over an already high baseline.

There are two parts to this download. First of all, there are the annotations that the annotators created (queryDB.zip). This file is in a tabular (tab-separated) format, where each line contains one annotation. The following fields are included:

  • query id the ID of the query
  • query the content of the query
  • session id the ID of the session this query was issued in
  • wikipedia id the concept ID
  • wikipedia title the concept label
  • ambiguous (Boolean) true if the annotator thought this query was ambiguous. The annotators were asked to find all possible concepts if the ambiguity was limited.
  • typo (Boolean) true if the annotator thought this query was a typo.
  • unknown (Boolean) true if the annotator had no idea what the query meant

All Wiki-/DBpedia ID’s can be found in the (Dutch) version that we used (3.2). The associated Wikipedia dump from which this DBpedia version was created has dump date 20080609.

The second part of the files contain the extracted features for each query-concept pair in ARFF format. What each individual feature means and how it was calculated can be found in the paper.
There are two files here, the first one (iswc09-ngram-features.arff.zip) contains the features for all the n-grams in the queries. The second (iswc09-wholequery-features.arff.zip) contains the features for the entire query, as detailed in the paper.

If you use this dataset, please cite the abovementioned paper.

For more information or if you have questions, please contact Edgar Meij.

Attachment Size
queryDB.zip 20.46 KB
iswc09-ngram-features.arff_.zip 591.26 KB
iswc09-wholequery-features.arff_.zip 569.64 KB