Each page in the WikiXML collection (i.e., article, category, image or template) is provided as a separate XML document that (in most cases - see below) conforms to the WikiXML DTD which is an extension of the XHTML 1.0 Transitional DTD with elements and attributes specific to the WikipediaXML collection. XML elements and attributes extending the XHTML format all belong to the wx namespace. Stripping such elements and attributes from the files of WikipediaXML results in valid XHTML documents.
XML files for a small number (about 1%) of pages do not conform to the WikiXML DTD, although all files are guaranteed to contain well-formed XML. Such non-valid pages are marked in the page table of the database distributed with the collection.
The XML format of the collection differs from Wikipedia's regular XHTML output in several aspects. Sections below describe the differences.
<div id="wx_toc"></div>; [edit] links are removed; <title> element reflects prefixed page name (e.g., Category:Operas_by_Pietro_Mascagni); <meta> elements provide the namespace, page id, page name, and redirection target for redirect pages; <div id="wx_article"> element; id attributes to all XML elements on every page (with the exception of <wx:templateend> elements);
The XML converter marks sections of pages using <wx:section level="nnn" title="Header of the section"> (here 1 ≤ nnn ≤ 6). In addition, sections include the usual XHTML header elements <h[123456]>. Sections end at the following header of same or smaller level or at the end of the containing element. Sections can be nested.
In general, the WikipediaXML collection uses XHTML <a...> elements to mark up all links in pages.
<div id="wx_categorylinks"> element at the end of pages; <div id="wx_languagelinks"> element; title attribute is removed from <a> element; wx:linktype="external" is added; [1], [2], etc.) are removed; wx:linktype="fragment" is added; wx:fragment contains the anchor name of the link target (name attribute of the target <a> element); wx:linktype="known" is added; wx:pagename contains the prefixed name of the target page; wx:page_id contains the page id of the target page, unless the target namespace is not in the scope of our collection; wx:fragment contains the anchor name of the target section; wx:linktype="unknown" is added; wx:pagename contains the prefixed name of the target page; wx:fragment contains the anchor name in the link target page, if provided; href attribute points to a standard /wiki/Page_Name URL instead of the edit script; wx:linktype="media" href attribute points to a standard /wiki/... URL instead of the target media item wx:linktype="interwiki" is added; wx:pagename contains a qualified name of the target page, such as wiki:My_page or wiki:Category:My_category wx:fragment contains the anchor name in the link target page, if provided; wx:linktype="self" is added; <ref> markup in wiki-text): wx:linktype="note" is added wx:linktype="noteref" is added;
Templates (e.g., {{infobox | param | param}}, see here) are expanded as usual. Additionally:
<wx:template id="xxx" pagename="Template:name" page_id="nnn"/> and the end is marked with <wx:templateend start="xxx"/> (here xxx is an element id); <wx:templatearguments for="xxx"> for each invocation of a template, containing zero or more elements <wx:argument name="name"> value </wx:argument> that specify names and values of the template arguments as defined in the original wiki-text of the page; for example, the article Germany uses the template Infobox Country or territory and its argument capital is a link to the article Berlin; In general, interpreting Wikipedia's template argument values is very hard, because a template can transform its arguments in many ways: namespaces can be added, links inserted or removed, etc.; the converter only handles easily identifiable usages of template arguments;
The marking of the boudaries of template expansions with two empty elements <wx:template .../> expanded template <wx:templateend.../> is probably not the nicest solution and may be not convenient to work with. It is necessary, however, because expanded templates are not always balanced XML fragments and can therefore not be contained inside a single XML element such as <wx:template...> expanded template </template>
Images included in Wikipedia pages are marked up with XHMTL img and a elements:
<a> element gets extra attributes: wx:linktype="image", wx:pagename, wx:page_id (the name and id of the corresponding image page, if included in the collection); src attribute of the <img> element is set to /wiki/Image:... <div> element together with the optional image caption; <math> markup (used in wiki-text to display formulas, see here) is passed to the XML output unmodified; <timeline> scripts (see here) are passed to the XML output unmodified; <hiero> markup (see WikiHiero) is passed to the XML output unmodified; page_category table in the database; {{FULLPAGENAME}}) are expanded as in MediaWiki's XHTML output, and moreover: <wx:variable name="FULLPAGENAME"> expanded text </wx:variable>