The fifth workshop on NLP and XML with theme Multi-dimensional Markup in Natural Language Processing is the theme of the fifth workshop on NLP and XML (NLPXML-2006)
- What is Multi-dimensional Markup?
- What problems are associated with combining markup?
- Would these problems not disappear if we just did not use XML/SGML?
- What solutions are there for markup combinations?
- Where can I find more information on combining markup?
1. What is Multi-dimensional Markup?
Multi-dimensional markup refers to any combination of document annotations. For example, a document may contain specifications of typographic entities (headings, paragraphs and lines) as well as annotations of linguistic entities (sentences, phrases and words). The combination of two or more of such annotation sets is called Multi-dimensional markup.
2. What problems are associated with combining markup?
The biggest problem associated combining markup is overlap: when one entity continues after the end of a previously started entity: <entity1> A <entity2> B </entity1> C </entity2>. Such a construction can easily occur if two independent annotation sets are combined but it is not allowed in standard versions of XML.
A larger scale example of this problem can be found in Durusau & O'Donnell's XML Europe 2002 paper.
3. Would these problems not disappear if we just did not use XML/SGML?
This is indeed a problem of the structure of XML/SGML but abandoning these is not an option. XML is the current annotation standard and additional to XML annotations for text corpora, various XML processing software are being developed, all of which we want to be able to apply to our data.
4. What solutions are there for markup combinations?
The most common approach to combining different text annotations in the field of Natural Language Processing is standoff annotation, in which independent annotations pointing to positions in a text file are kept aside from the data. Well-known systems that use standoff markup are GATE (General Architecture for Text Engineering), Callisto, NXT (NITE XML Toolkit), MMAX, AGTK (Annotation Graph Toolkit) and ATLAS (Architecture and Tools for Linguistic Analysis Systems). Another solution is GODDAG which represents annotations as graphs rather than trees.
5. Where can I find more information on combining markup?
The theoretical and practical problems of combining different markup levels have been discussed at the different editions of the Extreme Markup conferences.
Another good source for information is the TEI Overlapping Markup SIG wiki.
The XML Hierarchies page on Cover Pages gives a timeline of events in hierachical annotation with many links to interesting information.