This page contains data and ground truth used in this paper
The data is a crawl performed in April 2011 of search results for thirty three person name queries obtained from various search engines.
The ground truth was manually created. It is published here in the format used in the Web People Search (WePS) campaigns organized in 2007, 2009 and 2010.
- http://ilps.science.uva.nl/datafiles/dutch_people_search_corpus.tgz (622M)
- Ground truth
- http://ilps.science.uva.nl/datafiles/dutch_people_search_ground_truth.tgz (252K)
If you want to use this data, please cite this paper.
If you have questions about this data, contact Richard at r dot w dot berendsen at uva dot nl
Some practical notes
When you wish to use the corpus, do not hesitate to contact us if you encounter a problem. In this way, we can extend the notes below.
Locating the web pages
The pages were retrieved with the program
. Each document for each query has its own directory, in that directory the location of the main page that was downloaded is logged in the file
. Look at the line that contains “Saving to:”.
As a side note, pages from some social media platforms, such as Facebook, were first retrieved with an Ajax scraper called Crowbar. Then the “crowbarred” pages were in turn retrieved with wget. This was done because without Ajax these platforms do not show any content.
Viewing pages in a browser