Language identification on Twitter data is a challenging task. In this paper, we train TextCat on a set of English, German, French, Dutch, and Spanish tweets and show that retraining helps a lot, achieving up to 95% accuracy on English, compared to 88% using a model trained on non-Twitter data.
Below you can download two resources. First, we make available our Twitter-trained models for TextCat, which can be used to do reliable language identification for five major Twitter-languages. Second, we offer training data in the form of tweetID-language pairs. Note that for privacy reasons we cannot offer the tweets’ contents. The ID list allows you to reconstruct the training data, but it can happen that tweets have become unavailable in the meantime (account deleted, tweet deleted, private account, etc.).
You can use either resource for your own work, but please cite our paper when you do so: Simon Carter, Wouter Weerkamp, and Manos Tsagkias: Microblog Language Identification: Overcoming the Limitations of Short, Unedited and Idiomatic Text. In Language Resources and Evaluation Journal, 2012 (bibtex).
|Trained TextCat Models||6.86 KB|
|Development, Training, and Test set for Language Identification on Twitter||28.1 KB|