MataHari: Machine Translation with Harvested Internet Resources

Duration: 2010-2012

Summary: The main objective of the proposed research is to set
the first step in building a machine translation framework that
achieves truly global translation capabilities by covering a large
number of languages. To this end this project will investigate a
number of languages that have not—or just to a small extent—been
covered so far by existing research.

The methods investigated in this project fall under the paradigm of
statistical machine translation, which uses a parallel corpus, i.e.,
documents that have been translated by a professional translator, and
then automatically learns the translation rules from this set of

As the proposed project focuses on languages that have not been
covered so far to a large extent, it has to address novel challenges
and goes beyond existing academic and commercial research in a number
of ways. There are hardly any readily available bilingual training
data for the languages considered here, unlike for Arabic or Chinese,
where sizable parallel corpora are distributed by the Linguistic Data
Consortium (LDC). This means that we have to acquire the necessary
training data ourselves.

To this end we will utilize internet resources to learn translation
models. By exploiting online resources for machine translation this
project will address a number of vital research issues:

  • How can multi-lingual resources be automatically
    identified and harvested?

  • How can translation rules be learned
    from smaller and only partially translated resources?

  • How do
    existing search strategies for finding the most likely translation
    have to be adapted to cope with limited resources?

  • How can one
    rapidly build evaluation benchmarks for languages with limited


  • Simon Carter
  • Christof Monz