Social networking spam

Spam detection in social networking sites is a challenging problem, mainly due to the volume of messages and the pace with which their contents change. In this paper we propose a method for spam detection based on the HITS algorithm. To evaluate our models we use a dataset from the largest Dutch social networking site (Hyves).

The (anonymized) dataset contains 1,195 messages that are annotated as spam or not-spam. For each message the dataset includes the author ID, a list of IDs of the reporters, and the content of the messages (for which each distinct word is assigned a term ID).

If you want to use this dataset, please cite the following paper for which this set was constructed: Maarten Bosma, Edgar Meij, and Wouter Weerkamp. A Framework for Unsupervised Spam Detection in Social Networking Sites. In ECIR 2012 (bibtex).

Attachment Size
ecir2012-spam-dataset.zip 798.23 KB