Result Disambiguation in Web People Search: Data and ground truth

This page contains data and ground truth used in this paper

The data is a crawl performed in April 2011 of search results for thirty three person name queries obtained from various search engines.

The ground truth was manually created. It is published here in the format used in the Web People Search (WePS) campaigns organized in 2007, 2009 and 2010.

Data
http://ilps.science.uva.nl/datafiles/dutch_people_search_corpus.tgz (622M)
Ground truth
http://ilps.science.uva.nl/datafiles/dutch_people_search_ground_truth.tgz (252K)

If you want to use this data, please cite this paper.

If you have questions about this data, contact Richard at r dot w dot berendsen at uva dot nl

Some practical notes

When you wish to use the corpus, do not hesitate to contact us if you encounter a problem. In this way, we can extend the notes below.

Locating the web pages
The pages were retrieved with the program

wget

. Each document for each query has its own directory, in that directory the location of the main page that was downloaded is logged in the file

wget_log

. Look at the line that contains “Saving to:”.

As a side note, pages from some social media platforms, such as Facebook, were first retrieved with an Ajax scraper called Crowbar. Then the “crowbarred” pages were in turn retrieved with wget. This was done because without Ajax these platforms do not show any content.

Viewing pages in a browser
The collection was created with the aim of retrieving complete web pages for offline use. There are two things to keep in mind when viewing the pages with a browser. First, please use e.g. Firefox in offline mode when viewing the pages. This way, you view the pages in the same way your algorithms do, and regardless of when you look at the files: no new results are loaded by the Pages. Second, some sites hide their content by executing a Javascript. If you find you are looking at a blank page, try disabling Javascript.