A Toy Corpus of SPAM in Blog Comments

A small collection of 50 blog pages, with 1024 comments; manual classifications of these comments as spam or non-spam (67% are spam). For questions, contact Gilad Mishne.

Note: by downloading the corpus you agree to the disclaimer.

If you publish results obtained using this resource, please cite this paper:

  • Blocking Blog Spam with Language Model Disagreement, G. Mishne, D. Carmel, and R. Lempel. In: AIRWeb ’05 – First International Workshop on Adversarial Information Retrieval on the Web, at the 14th International World Wide Web Conference (WWW2005), 2005. [PDF]

Blog spam corpus – disclaimer


    1. The Information has been obtained by crawling the Internet. Due to the amount of Information it has not been practicable to obtain permission from copyright owners to provide the Information for the Permitted Uses


    1. The Universiteit van Amsterdam (UvA) understands that all the documents in the Information are documents which have been at some time made publicly available on the Internet and which have been collected using a process which respects the commonly accepted methods (such as robots.txt) for indicating that the documents should not be so collected.
    2. Owners of copyright in individual documents may choose to request deletion of these documents from the Information.
    3. The limitation on permitted use contained in the following section is intended to reduce the risk of any action being brought by copyright owners, but if this happens the Organisation has agreed under its application form to bear all associated liability.


Permitted Uses

    1. The Information may only be used for non-commercial research and development of natural-language-processing, Information-retrieval or document-understanding systems.
    2. Summaries, analyses and interpretations of the linguistic properties of the Information may be derived and published, provided it is not possible to reconstruct the Information from these summaries.
    3. Small excerpts of the Information may be displayed to others or published in a scientific or technical context, solely for the purpose of describing the research and development carried out and related issues.
    4. All efforts must be made not to infringe the rights of any third party including, but limited to, the authors and publishers of any excerpts used in accordance with clause 3.


  • This collection may not be redistributed.


Agreement to Delete Data on Request

I undertake to delete within thirty days of receiving notice all copies of any nominated document that is part of the Information whenever requested to do so by any one of:

  1. UvA; or
  2. the owner of copyright for the particular document.

Access to the Information by Individuals

I understand that the Organisation has agreed to certain obligations in respect of my access in an Agreement with UvA and I agree to use the Information only for the Permitted Uses and to comply with those obligations in the Agreement as they affect me. Those obligations are that the Organisation:

  1. must control access to the Information by individuals and may only grant access to people working under its control, i.e., its own members, consultants to the Organisation, or individuals providing service to the Organisation;
  2. must ensure that before being given access I complete and submit this Individual Application form;
  3. must terminate my access when the conditions of the application no longer apply;
  4. remains responsible for any breach of the Individual Application form by me;
  5. will retain the applications of all persons ever granted access to the Information and make them available upon request to any of the copyright holders and to UvA.
  6. will maintain a list of people with current and recently-terminated access to the Information and make it available to UvA on request; and
  7. must make sure that I only display the Information to or share the Information with persons whom my Organisation lists as having access to the Information.