Ssscrape: a system for collecting dynamic web data

Ssscrape stands for Syndicated and Semi-Structured Content Retrieval and
Processing Environment. Ssscrape is a framework for crawling and processing
dynamic web data, such as RSS/Atom feeds.

Ssscrape is a system for tracking dynamic online collections of items: RSS
feeds, blogs, news, podcasts etc. For a set of online data sources, user can
configure Ssscrape to:

  • periodically check for new information items;
  • download and store (e.g., in a database) items along with available
    meta-data;
  • clean the content (e.g., producing plain text) and perform other
    application-specific processing (e.g., tagging, duplicate detection,
    linking)
  • monitor activity and report errors

Ssscrape is flexible and easily expandable:

  • new online data sources added simply by specifying URLs, periodicity and
    specific processing methods
  • new data processing methods (workers) can easily be added as scripts with a
    simple API

The following people contributed to the design and implementation of Ssscrape:

  • Wouter Bolsterlee
  • Breyten Ernsting
  • Valentin Jijkoun
  • Fons Laan
  • Manos Tsagkias

Ssscrape is distributed under GNU Lesser General Public License.

Download the version of Feb 2010: Ssscrape-1.0.tar.gz (212.04 KB)

Latest versions and more information: https://github.com/ssscrape/ssscrape

Questions, suggestions, comments? Contact Valentin Jijkoun: jijkoun AT uva.nl.