In addition to the XML-encoded content of Wikipedia pages, each language collection provides a number of database tables that simplify access to the data.
Table: pageThis table contains information about all pages in the corresponding WikiXML collection for the given language. | |
| page_id | id of the page, a number unique within one language collection |
| page_namespace | namespace (type) of the page: 'article', 'category', 'image' or 'template' |
| page_title | title of the page in UTF-8 encoding (without namespace) |
| page_wikilen | length in bytes of the page's wiki text |
| page_xmllen | length in bytes of the page's XML document |
| page_is_redirect | 1 if the page is a redirect to another page, otherwise 0 |
| page_is_valid | 1 if the XML of the page conforms to the WikiXML DTD, otherwise 0 |
Table: page_imageFor each page in the collection, this table lists images included in the page, that are actually present in the collection (as page from the image namespace).
| |
| page_id | id of the source page where image appears |
| image_id | id of the image page |
| anchor_id | id of the XML <a> element on the source page that provides a link to the page of the image |
Note that since galleries (annotated using <gallery> tag in wikitext, see here) are currently not processed by the converter, images in galleries will not be included in the page_image table, but rather end up as usual page links in the page_link table.
| |
Table: redirectThis table lists all Wikipedia redirects (e.g., article ECG is a redirect to article Electrocardiogram). | |
| page_id | id of the redirection page |
| target_id | id of the page it redirects to |
Table: page_langlinkThis table provides information about language (interwiki) links for pages. For each page it lists its counterparts in other languages. | |
| source_wiki | the language prefix of the page ("en" for English, "nl" for Dutch, "es" for Spanish etc.) |
| source_id | id of the page |
| target_name | is a prefixed name of the version of the page in another language (e.g., German version de:Ostasien for English page East Asia) |
| target_wiki target_id | language prefix and the id of the version in another language; these fields will not be set in the versions of the "page_language" table distributed with individual language Wikipedia XML collections: a complete table with all interwiki information should be downloaded and install separately as described in Downloading and installing WikiXML |
Table: xmlpageThis table provides XML documents, the result of the conversion of the Wikipedia pages (see XML format for WikiXML). The contents of this table are not included in the databases distributed with the collections, due to the large size. You need do download the corresponding XML collections and load them into the database as described in Downloading and installing WikiXML. | |
| page_id | id of the page |
| page_xml | the UTF-8 encoded XML content for the page |
Below you can find formal MySQL defintions of the tables described above.
--
-- This essentially mirrors the "page" table from the Wikipedia dump.
-- Of course it contains only pages that are included in the collection.
--
CREATE TABLE page (
-- Identical to page_id from Wikipedia dump
page_id int unsigned not null,
-- Page namespace
page_namespace enum ('article', 'category', 'image', 'template') not null,
-- Identical to page_title from Wikipedia dump
page_title varchar(255) binary not null,
-- Length of wiki text; identical to page_len from Wikipedia dump
page_wikilen int unsigned not null,
-- Length of XML document in bytes (not characters)
page_xmllen int unsigned not null,
-- Nonzero for redirect pages, zero otherwise
page_is_redirect tinyint(1) not null,
-- Nonzero if the XML document is valid, zero otherwise
page_is_valid tinyint(1) not null,
primary key (page_id),
unique index title (page_namespace, page_title),
);
--
-- Category index. Contains only membership entries that could be resolved
-- to an existing category; ie. "[[Category::non_existing_title]]" is ignored.
--
CREATE TABLE page_category (
-- Page id of member
page_id int unsigned not null,
-- Page id of category
category_id int unsigned not null,
index (page_id),
index (category_id),
);
--
-- Wiki links on the page, including unknown links (to non-existing pages)
-- and inline links to categories, but excluding category links, language links,
-- interwiki links and external links.
--
CREATE TABLE page_link (
-- Source page id
source_id int unsigned not null,
-- Target page id (for known links)
target_id int unsigned null,
-- Target page name; ie. "Easter"
target_name varchar(255) binary not null,
-- Target section name (from eg. [[Easter#Traditions]])
target_fragment varchar(255) binary null,
-- Id of <a> element
anchor_id varchar(255) binary not null,
-- Anchor text of link
anchor_text varchar(255) binary null,
primary key (page_id, link_id)
index (target_id)
);
--
-- Index of images that appear on each page. Only images that exist within the same
-- wiki are indexed (ie. images on commons are ignored).
--
CREATE TABLE page_image (
-- Page id
page_id int unsigned not null,
-- Image page id
image_id int unsigned not null,
-- Id attribute of anchor element
anchor_id varchar(255) binary not null,
primary key (page_id, anchor_id),
index (image_id)
);
--
-- Index of templates that are used on each page.
-- Contains only top-level templates (not the templates used inside templates).
-- Contains only templates that we could successfully mark and resolve
-- during the conversion.
--
CREATE TABLE page_template (
-- Page id
page_id int unsigned not null,
-- Template page id
template_id int unsigned not null,
-- Id of wx:template element
element_id varchar(10) binary not null,
primary key (page_id, template_id),
index (template_id)
);
--
-- Redirection index. Contains only redirection pages that could be
-- resolved to an existing page; ie. "#REDIRECT [[non_existing_title]]"
-- is ignored.
--
-- Like real Wikipedia, we don't follow repeated redirections. So the
-- target of a redirection may again be a redirection page.
--
CREATE TABLE redirect (
-- Page id of the redirection page
page_id int unsigned not null,
-- Page id of the redirection target
target_id int unsigned not null,
primary key (page_id),
index (target_id)
);
--
-- Language link table. After converting all languages, these tables
-- are merged into one global table.
--
-- Initially, only the target_name of the targets is known; we attempt
-- to resolve these names to (target_wiki, target_id) in a separate operation,
-- using the page entries of the target wiki.
--
CREATE TABLE page_langlink (
-- Source page key
source_wiki varchar(16) binary not null,
source_id int unsigned not null,
-- Target page key
target_wiki varchar(16) binary null,
target_id int unsigned null,
-- Target page name; ie. "nl:Categorie:Kerst"
target_name varchar(255) binary not null,
index source (source_wiki, source_id),
index target (target_wiki, target_id)
);
--
-- This table contains the XML conversion output.
-- It is filled by running a separate script: insertxml.py
--
CREATE TABLE xmlpage (
-- Page ID
page_id int unsigned not null,
-- XML document
page_xml mediumblob not null,
primary key (page_id)
);