A set of Arabic weblog data, collected by Woiyl Hammoumi, is available from this page.
Arabic blogs dataset
The Arabic blogs dataset consists of approximately 12.000 Arabic blogs containing a number of posts that exceeds 120.300 posts. The oldest blog post in the dataset dates back to 2002. The dataset contains also relevant blog data. The blog data consist of information about blogs such as title, description, and URL address. In addition, it consists of blog posts data like author’s name, date of publication, and content of the post.
The blogs dataset contains data from various sources. The blogs in the dataset belong to diverse blogging platforms. In total, there are four platforms that contribute to the dataset. Two of them are popular Arabic blogging platforms. These are Maktoob (www.maktoobblog.com) and Jeeran (www.jeeran.com). The two others are non-Arabic platforms. But, they give possibilities to create and manage blogs in Arabic. These are Blogger (www.blogger.com) and Microsoft Network spaces (spaces.live.com).
The blogs dataset has posts written in Arabic and non-Arabic languages. 62% of posts in the dataset are written in Arabic. This is followed by 25% of the posts are written in English and 4% of the posts are composed in French.
- Download: Arabic_blogs_dataset.zip (ZIP file containing 11886 XML files, 170 MB)
Collection of Arabic blog posts
The collection of Arabic blog posts consists of 71.674 blog posts in Arabic. It was extracted from the Arabic blogs dataset on February 2007. The blog posts in the collection are stored in plain text with UTF-8 encoding. Besides, some of them were used in the experiments that we have performed on the Arabic blogs.
- Download: Arabic_blog_posts.tgz (tar.gzip archive containing 71674 text files, UTF-8 encoded, 79 MB)