Web FAQ collection

We have made available for research purposes the Web FAQ data described in V. Jijkoun and M. de Rijke. Retrieving Answers from Frequently Asked Question Pages on the Web, CIKM 2005 [PDF].

Web FAQ data

  • 100 questions taken from MetaCrawler logs and manually annotated with question type (procedural, factoid, description, explanation, non-question, definition, direction, other).
    Format: id TAB question TAB type
  • List of 405197 URLs of FAQ pages, created by querying Google with
    query inurl:faq
    Download (gz, 4.5M)
  • The collection of 293,031 FAQ pages downloaded from the URLs above
    See User Agreement below.
  • The collection of 2,824,179 question/answer pairs (XML format) automatically extracted from the downloaded FAQ pages
    See User Agreement below.
  • The test collection for the evaluation of the Q/A pairs extraction: original FAQ files taken from the Web and manually extracted Q/A pairs.Download (tar.gz, 1.2M)

User Agreement

Because of possible copyright issues for the crawled Web pages, we ask you to fill in and send us by fax the User Agreement.
As soon as we’ve received the filled and signed Agreement, we’ll send you the download instructions by e-mail.

If you have questions, please contact Valentin Jijkoun.

Last modified: July 14, 2006