information and language processing systems
isla, university of amsterdam

WikiXML: Database structure

In addition to the XML-encoded content of Wikipedia pages, each language collection provides a number of database tables that simplify access to the data.

Table summary

Table: page
This table contains information about all pages in the corresponding WikiXML collection for the given language.
page_idid of the page, a number unique within one language collection
page_namespacenamespace (type) of the page: 'article', 'category', 'image' or 'template'
page_titletitle of the page in UTF-8 encoding (without namespace)
page_wikilenlength in bytes of the page's wiki text
page_xmllenlength in bytes of the page's XML document
page_is_redirect1 if the page is a redirect to another page, otherwise 0
page_is_valid1 if the XML of the page conforms to the WikiXML DTD, otherwise 0

Table "page_category"
This table includes information about categories of particles and sub/super-categories of categories.
page_idid of a page, which is either an article or a category page
category_idid of a page which is a category of the article page_id or a super-category of the category page_id

Table: page_link
This table lists all links between Wikipedia pages in the collection. It only contains links extracted from the content of pages, i.e., no links to categories of pages.
source_idid of the source page of the link
target_idid of the target page of the link, in case the target is part of the (monolingual) collection
target_namename of the target page, as provided in the original wiki text
target_fragmentif provided, the name of an anchor on the target page, typically, section name (e.g., from wikitext link Easter#Traditions)
anchor_nameid of the <a> element that marks the link on the source page
anchor_textanchor text of this <a> element
Note that target_id might be undefined (NULL) if the target page of a link does not exist (e.g., if the article has not been created yet, but is already linked to).

Table: page_image
For each page in the collection, this table lists images included in the page, that are actually present in the collection (as page from the image namespace).
page_idid of the source page where image appears
image_idid of the image page
anchor_idid of the XML <a> element on the source page that provides a link to the page of the image
Note that since galleries (annotated using <gallery> tag in wikitext, see here) are currently not processed by the converter, images in galleries will not be included in the page_image table, but rather end up as usual page links in the page_link table.

Table: page_template
This table lists templates used on the pages of the collection. It contains only top-level templates (i.e., no nested templates) and only those templates that were successfully identified by the converter.
page_idid of the page where a template is used
template_idid of the page of the template (remember that templates are also pages in the collection)
element_idid of the corresponding XML <wx_template element on the page where the template is used

Table: redirect
This table lists all Wikipedia redirects (e.g., article ECG is a redirect to article Electrocardiogram).
page_idid of the redirection page
target_idid of the page it redirects to

Table: page_langlink
This table provides information about language (interwiki) links for pages. For each page it lists its counterparts in other languages.
source_wikithe language prefix of the page ("en" for English, "nl" for Dutch, "es" for Spanish etc.)
source_idid of the page
target_nameis a prefixed name of the version of the page in another language (e.g., German version de:Ostasien for English page East Asia)
target_wiki
target_id
language prefix and the id of the version in another language; these fields will not be set in the versions of the "page_language" table distributed with individual language Wikipedia XML collections: a complete table with all interwiki information should be downloaded and install separately as described in Downloading and installing WikiXML

Table: xmlpage
This table provides XML documents, the result of the conversion of the Wikipedia pages (see XML format for WikiXML). The contents of this table are not included in the databases distributed with the collections, due to the large size. You need do download the corresponding XML collections and load them into the database as described in Downloading and installing WikiXML.
page_idid of the page
page_xmlthe UTF-8 encoded XML content for the page

MySQL table definitions

Below you can find formal MySQL defintions of the tables described above.

--
-- This essentially mirrors the "page" table from the Wikipedia dump.
-- Of course it contains only pages that are included in the collection.
--
CREATE TABLE page (

 -- Identical to page_id from Wikipedia dump
 page_id            int unsigned not null,

 -- Page namespace
 page_namespace     enum ('article', 'category', 'image', 'template') not null,

 -- Identical to page_title from Wikipedia dump
 page_title         varchar(255) binary not null,

 -- Length of wiki text; identical to page_len from Wikipedia dump
 page_wikilen       int unsigned not null,

 -- Length of XML document in bytes (not characters)
 page_xmllen        int unsigned not null,

 -- Nonzero for redirect pages, zero otherwise
 page_is_redirect   tinyint(1) not null,

 -- Nonzero if the XML document is valid, zero otherwise
 page_is_valid      tinyint(1) not null,

 primary key (page_id),
 unique index title (page_namespace, page_title),
);


--
-- Category index. Contains only membership entries that could be resolved
-- to an existing category; ie. "[[Category::non_existing_title]]" is ignored.
--
CREATE TABLE page_category (

 -- Page id of member
 page_id            int unsigned not null,

 -- Page id of category
 category_id        int unsigned not null,

 index (page_id),
 index (category_id),
);


--
-- Wiki links on the page, including unknown links (to non-existing pages)
-- and inline links to categories, but excluding category links, language links,
-- interwiki links and external links.
--
CREATE TABLE page_link (

 -- Source page id
 source_id            int unsigned not null,

 -- Target page id (for known links)
 target_id          int unsigned null,

 -- Target page name; ie. "Easter"
 target_name        varchar(255) binary not null,

 -- Target section name (from eg. [[Easter#Traditions]])
 target_fragment    varchar(255) binary null,

 -- Id of <a> element
 anchor_id          varchar(255) binary not null,

 -- Anchor text of link
 anchor_text        varchar(255) binary null,

 primary key (page_id, link_id)
 index (target_id)
);


--
-- Index of images that appear on each page. Only images that exist within the same
-- wiki are indexed (ie. images on commons are ignored).
--
CREATE TABLE page_image (

 -- Page id
 page_id            int unsigned not null,

 -- Image page id
 image_id           int unsigned not null,

 -- Id attribute of anchor element
 anchor_id         varchar(255) binary not null,

 primary key (page_id, anchor_id),
 index (image_id)
);


--
-- Index of templates that are used on each page.
-- Contains only top-level templates (not the templates used inside templates).
-- Contains only templates that we could successfully mark and resolve
-- during the conversion.
--
CREATE TABLE page_template (

 -- Page id
 page_id            int unsigned not null,

 -- Template page id
 template_id        int unsigned not null,

 -- Id of wx:template element
 element_id         varchar(10) binary not null,

 primary key (page_id, template_id),
 index (template_id)
);


--
-- Redirection index. Contains only redirection pages that could be
-- resolved to an existing page; ie. "#REDIRECT [[non_existing_title]]"
-- is ignored.
--
-- Like real Wikipedia, we don't follow repeated redirections. So the
-- target of a redirection may again be a redirection page.
--
CREATE TABLE redirect (

 -- Page id of the redirection page
 page_id            int unsigned not null,

 -- Page id of the redirection target
 target_id        int unsigned not null,

 primary key (page_id),
 index (target_id)
);


--
-- Language link table. After converting all languages, these tables
-- are merged into one global table.
--
-- Initially, only the target_name of the targets is known; we attempt
-- to resolve these names to (target_wiki, target_id) in a separate operation,
-- using the page entries of the target wiki.
--
CREATE TABLE page_langlink (

 -- Source page key
 source_wiki          varchar(16) binary not null,
 source_id            int unsigned not null,

 -- Target page key
 target_wiki        varchar(16) binary null,
 target_id          int unsigned null,

 -- Target page name; ie. "nl:Categorie:Kerst"
 target_name        varchar(255) binary not null,

 index source (source_wiki, source_id),
 index target (target_wiki, target_id)
);


--
-- This table contains the XML conversion output.
-- It is filled by running a separate script: insertxml.py
--
CREATE TABLE xmlpage (

  -- Page ID
  page_id            int unsigned not null,

  -- XML document
  page_xml           mediumblob not null,

  primary key (page_id)
);