Wikipedia is a terrific knowledge resource, and many recent studies
in artificial intelligence, information retrieval and related fields used Wikipedia to endow
computers with (some) human knowledge. Wikipedia dumps
are publicly available in XML format, but they have a few shortcomings. First, they contain a lot of information
that is often not used when Wikipedia texts are used as knowledge (e.g., ids of users who changed each article,
timestamps of article modifications). On the other hand, the XML dumps do not contain a lot of useful information that
could be inferred from the dump, such as link tables, category hierarchy, resolution of redirection links etc.
In my Ph.D. work, I developed a fairly extensive preprocessor of the
standard Wikipedia XML dump into my own extended format, which eliminates some information and adds other useful information.
The Wikipedia preprocessor is a single Perl script, which can be downloaded here.
The software is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
If the input file is named XXX.xml, then the following files will be produced:
XXX.hgw.xml - the extended XML format (see description below)
XXX.log - a HUGE log file with a lot of information about the preprocessing progress.
Unless you're debugging the script, this file can almost always be safely deleted, as its size can easily reach tens of gigabytes.
In fact, you should probably disable most of the commands in the script that send information to the log file.
XXX.anchor_text.sorted - anchor text associated with each internal link between Wikipedia articles
XXX.cat_hier - the hierarchy of Wikipedia categories
XXX.related_links - lists of related articles identified using contextual hints ("Further information", "Related topic", "See also" etc.)
XXX.stat.categories - number of pages in each Wikipedia category
XXX.stat.inlinks - number of incoming links for each article
To give you an idea of how the output files look like, here you can download
a BZIP2 archive with the output files (except for the log file) produced for the November 11, 2005, snapshot of the English Wikipedia.
The original snapshot itself is available here. Of course, you should also strive to use the
latest Wikipedia snapshot; unfortunately, I do not have enough storage (nor bandwidth) to provide you with preprocessed versions of the latest
dumps. The script has been tested on the snapshot dated July 19, 2007, and produced about 9Gb of output file (not counting the log file of about 40+ Gb).
- Resolve all redirection links, that is if B redirects to C, and A links to B, then A should link
directly to C. After you resolve all redirections this way, you can build an entire link graph.
Another reason to resolve links is that you can then use anchor text as additional source of knowledge.
Further info on links & redirects:
- When handling links, note that Wikipedia has a convention that only the first letter of the article
title is case-sensitive, the remaining ones are lowercased, so you need to normalize the titles
to be able to match identical ones. A similar story holds for the dates, as the following variants
would all lead to the same article:
1) [[July 20]], []
2) [[20 July]] []
- Many Wikipedia pages include templates, and some of them further include nested templates.
Templates often contain valuable and relevant information, so it makes a lot of sense to try to
resolve them, at least to some nesting depth (I resolved them to depth 5). The way I did it is
to pre-scan the entire set of Wikipedia articles, and parse all templates, then scan the articles
again and include the templates as needed. To this end, you'll need to understand how template
parameters are inserted. Here are a few relevant URLs:
Specifically, learn about and tags.
- You'll probably want to remove stubs & other short articles:
- Wikipedia has an elaborate system of namespaces to which articles belong. I'd recommend that you
learn about the namespaces here:
- More on Wikipedia data dumps:
Well, it's free, so don't expect too much support :) I will likely be able to answer simple questions,
but not complex programming questions (please refer to your local Perl guru). I do not promise to correct bugs,
but I will try to do my best, especially if you suggest specific ways to correct the bug you encountered
(in which case your contribution will, of course, be acknowledged).