information and language processing systems
isla, university of amsterdam

WikiXML: XML format

Each page in the WikiXML collection (i.e., article, category, image or template) is provided as a separate XML document that (in most cases - see below) conforms to the WikiXML DTD which is an extension of the XHTML 1.0 Transitional DTD with elements and attributes specific to the WikipediaXML collection. XML elements and attributes extending the XHTML format all belong to the wx namespace. Stripping such elements and attributes from the files of WikipediaXML results in valid XHTML documents.

XML files for a small number (about 1%) of pages do not conform to the WikiXML DTD, although all files are guaranteed to contain well-formed XML. Such non-valid pages are marked in the page table of the database distributed with the collection.

The XML format of the collection differs from Wikipedia's regular XHTML output in several aspects. Sections below describe the differences.

Page content and meta-data

Sections

The XML converter marks sections of pages using <wx:section level="nnn" title="Header of the section"> (here 1 ≤ nnn ≤ 6). In addition, sections include the usual XHTML header elements <h[123456]>. Sections end at the following header of same or smaller level or at the end of the containing element. Sections can be nested.

Links

In general, the WikipediaXML collection uses XHTML <a...> elements to mark up all links in pages.

Templates

Templates (e.g., {{infobox | param | param}}, see here) are expanded as usual. Additionally:

In general, interpreting Wikipedia's template argument values is very hard, because a template can transform its arguments in many ways: namespaces can be added, links inserted or removed, etc.; the converter only handles easily identifiable usages of template arguments;

The marking of the boudaries of template expansions with two empty elements <wx:template .../> expanded template <wx:templateend.../> is probably not the nicest solution and may be not convenient to work with. It is necessary, however, because expanded templates are not always balanced XML fragments and can therefore not be contained inside a single XML element such as <wx:template...> expanded template </template>

Images

Images included in Wikipedia pages are marked up with XHMTL img and a elements:

Miscellaneous

Examples of the XML format