EuroGOV

EuroGOV, the document collection used at WebCLEF 2005, consists of web documents crawled from European governmental sites. Here's a list of (top level) domains from which pages are included:

at, be, cy, cz, de, dk, ee, es, fi, fr, gr, hu, ie, int, it, lt, lu, lv, mt, nl, pl, pt, ru, se, si, sk, uk

Access

Access to the collection is restricted to individuals and organizations that have filled out, signed, and sent us the appropriate end user agreements:

Upon receipt of the completed and signed organisational agreement, we will make available a login name and password that will enable participants to download the EuroGOV collection.

Collection fomat

The EuroGOV collection is split up into directories, one for each top-level domain. That is the directory structure will look like this:

(user@pc) ls
at  cy  de  ee  eu  fr  hu  it  lu  mt  pl  ru  si  uk
be  cz  dk  es  fi  gr  ie  lt  lv  nl  pt  se  sk
(user@pc)

Each directory contains one or more gzipped files. Each file contains up to 25,000 documents. Directories look like this:

(user@pc) ls s*/*
se/001.gz  se/003.gz  se/005.gz  sk/001.gz  sk/003.gz
se/002.gz  se/004.gz  si/001.gz  sk/002.gz
(user@pc)

Each file has the following format:

<EuroGOV:bin
 domain=""       <!-- The top level domain -->
 id="">          <!-- The name of the file -->
<EuroGOV:doc
 url=""          <!-- URL of the page -->
 id=""           <!-- DocID of the format Exx-yyy-z -->
                 <!--  E is E and stands for EuroGOV -->
                 <!--  xx is the top level domain -->
                 <!--  yyy is the file name -->
                 <!--  z is the character offset of the document -->
 md5=""          <!-- MD5 checksum of the content of the page -->
 fetchDate=""    <!-- Fetch date of the page -->
 contentType=""> <!-- contentType as given by the web server -->
<EuroGOV:content>
<![CDATA[
... content ...  <!-- This is the actual page -->
]]>
</EuroGOV:content>
</EuroGOV:doc>
...
</EuroGOV:bin>

This might smell like XML but it will not be XML. Because:

  1. We do not escape ampersand in URLs.

  2. Some documents contain the pattern

    <![CDATA[... ]]>
    

    in their content, but nested CDATA escaping is not XML.

Here are the first 20 lines of se/001.gz:

(user@pc) head se/001 -n 20
<EuroGOV:bin domain="se" id="001">
<EuroGOV:doc
 url="http://www.regeringen.se/"
 id="Ese-001-35"
 md5="659b462005b40f04bde5946b2beaad71"
 fetchDate="Wed Sep 22 10:57:39 MEST 2004"
 contentType="text/html">
<EuroGOV:content>
<![CDATA[
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="sv">
<head>
    <title>Regeringen och Regeringskansliet</title>
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
    <meta http-equiv="Content-Script-Type" content="text/javascript">
    <meta http-equiv="Content-Style-Type" content="text/css">
    <script language="javascript" type="text/javascript" src="/js/popup.js"></script>
    <script language="javascript" type="text/javascript" src="/js/validationTexts_sv.js"></script>
    <script language="javascript" type="text/javascript" src="/js/formFunctions.js"></script>
    <link rel="stylesheet" type="text/css" href="/css/deprecatedstyle.css">
(user@pc)

Furthermore, the following is distributed with EuroGOV:

  • A file which reports content duplicates. This will simply be a mapping between page-ids of pages that have the same MD5 checksum.

  • A file with output of a language detection program. This will be a mapping from set of page-ids to a superset of language ids. That is some pages will be a assigned multiple language. This happens when the language identification script is not sure.