EuroGOV
EuroGOV, the document collection used at WebCLEF 2005, consists of web documents crawled from European governmental sites. Here's a list of (top level) domains from which pages are included:
at, be, cy, cz, de, dk, ee, es, fi, fr, gr, hu, ie, int, it, lt, lu, lv, mt, nl, pl, pt, ru, se, si, sk, uk
Access
Access to the collection is restricted to individuals and organizations that have filled out, signed, and sent us the appropriate end user agreements:
Upon receipt of the completed and signed organisational agreement, we will make available a login name and password that will enable participants to download the EuroGOV collection.
Collection fomat
The EuroGOV collection is split up into directories, one for each top-level domain. That is the directory structure will look like this:
(user@pc) ls at cy de ee eu fr hu it lu mt pl ru si uk be cz dk es fi gr ie lt lv nl pt se sk (user@pc)
Each directory contains one or more gzipped files. Each file contains up to 25,000 documents. Directories look like this:
(user@pc) ls s*/* se/001.gz se/003.gz se/005.gz sk/001.gz sk/003.gz se/002.gz se/004.gz si/001.gz sk/002.gz (user@pc)
Each file has the following format:
<EuroGOV:bin
domain="" <!-- The top level domain -->
id=""> <!-- The name of the file -->
<EuroGOV:doc
url="" <!-- URL of the page -->
id="" <!-- DocID of the format Exx-yyy-z -->
<!-- E is E and stands for EuroGOV -->
<!-- xx is the top level domain -->
<!-- yyy is the file name -->
<!-- z is the character offset of the document -->
md5="" <!-- MD5 checksum of the content of the page -->
fetchDate="" <!-- Fetch date of the page -->
contentType=""> <!-- contentType as given by the web server -->
<EuroGOV:content>
<![CDATA[
... content ... <!-- This is the actual page -->
]]>
</EuroGOV:content>
</EuroGOV:doc>
...
</EuroGOV:bin>
This might smell like XML but it will not be XML. Because:
-
We do not escape ampersand in URLs.
-
Some documents contain the pattern
<![CDATA[... ]]>
in their content, but nested CDATA escaping is not XML.
Here are the first 20 lines of se/001.gz:
(user@pc) head se/001 -n 20
<EuroGOV:bin domain="se" id="001">
<EuroGOV:doc
url="http://www.regeringen.se/"
id="Ese-001-35"
md5="659b462005b40f04bde5946b2beaad71"
fetchDate="Wed Sep 22 10:57:39 MEST 2004"
contentType="text/html">
<EuroGOV:content>
<![CDATA[
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="sv">
<head>
<title>Regeringen och Regeringskansliet</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Script-Type" content="text/javascript">
<meta http-equiv="Content-Style-Type" content="text/css">
<script language="javascript" type="text/javascript" src="/js/popup.js"></script>
<script language="javascript" type="text/javascript" src="/js/validationTexts_sv.js"></script>
<script language="javascript" type="text/javascript" src="/js/formFunctions.js"></script>
<link rel="stylesheet" type="text/css" href="/css/deprecatedstyle.css">
(user@pc)
Furthermore, the following is distributed with EuroGOV:
-
A file which reports content duplicates. This will simply be a mapping between page-ids of pages that have the same MD5 checksum.
-
A file with output of a language detection program. This will be a mapping from set of page-ids to a superset of language ids. That is some pages will be a assigned multiple language. This happens when the language identification script is not sure.