Wikipedia is a terrific knowledge resource, and many recent studies
in artificial intelligence, information retrieval and related fields used Wikipedia to endow
computers with (some) human knowledge. Wikipedia dumps
are publicly available in XML format, but they have a few shortcomings. First, they contain a lot of information
that is often not used when Wikipedia texts are used as knowledge (e.g., ids of users who changed each article,
timestamps of article modifications). On the other hand, the XML dumps do not contain a lot of useful information that
could be inferred from the dump, such as link tables, category hierarchy, resolution of redirection links etc.
In the course of my Ph.D. work, I developed a fairly extensive preprocessor of the
standard Wikipedia XML dump into my own extended XML format, which eliminates some information and adds other useful information.
If the input file is named XXX.xml, then the following files will be produced:
XXX.hgw.xml - the extended XML format (see description below)
XXX.log - a HUGE log file with a lot of information about the preprocessing progress.
Unless you're debugging the script, this file can almost always be safely deleted, as its size can easily reach tens of gigabytes.
In fact, you should probably disable most of the commands in the script that send information to the log file.
XXX.anchor_text.sorted - anchor text associated with each internal link between Wikipedia articles
XXX.cat_hier - the hierarchy of Wikipedia categories
XXX.related_links - lists of related articles identified using contextual hints ("Further information", "Related topic", "See also" etc.)
XXX.stat.categories - number of pages in each Wikipedia category
XXX.stat.inlinks - number of incoming links for each article
wikipedia-051105-preprocessed.tar.bz2 (700+ Mb) - to give you an idea how the output files look like,
this BZIP2 archive contains a set of output files (except for the log file) produced for
the November 11, 2005, snapshot of the English Wikipedia.
sample.hgw.xml (1 Mb) - a small sample file with a few Wikipedia articles preprocessed by WikiPrep.
To see examples of all the other goodies produced by WikiPrep (link structure, category hierarchy, anchor text etc.), you'll need to download
the full example here.
Note: Of course, you should always strive to use the latest Wikipedia snapshot; unfortunately, I do not have enough storage (nor bandwidth) capacity
to provide you with preprocessed versions of the latest dumps. The script has been tested on the snapshot dated July 19, 2007, and produced about 9 Gb worth of output files
(not counting the log file of about 40+ Gb).
The preprocessor script accomplishes the following tasks:
Many Wikipedia pages include templates, and some of them further include nested templates.
Templates often contain valuable and relevant information, so it makes a lot of sense to try to
resolve them, at least to some nesting depth (the script resolves templates to depth 5).
To achieve this aim, the script first pre-scans the entire set of Wikipedia articles and parses all the templates,
then scans the articles again and embeds the templates as needed. In particular, the script correctly processes
<noinclude> and <includeonly> tags. Caveat: templates often get messy and hard to parse, therefore, you will occasionally find some artefacts
in the preprocessed article text, such as this sequence "]]}}". However, in the absolute majority of cases, ignoring such
artefacts is straightforward, while on the other hand developing a fool-proof preprocessor would be a very expensive
enterprise (and probably an overkill).
Resolve all redirection links, that is, if B redirects to C, and A links to B, then A should link directly to C.
After all redirections are resolved this way, one can easily build the entire link graph.
Another reason to resolve links is that one can then easily collect all anchor text for each article, and use anchor
text as additional source of knowledge.
Wikipedia has a convention that only the first letter of the article title is case-sensitive, the remaining ones are lowercased,
so the script normalizes the titles to be able to match identical ones. Dates also have to be normalized, for example, the following variants
would all lead to the same article:
[[January 1]], [[2000]]
[[1 January]] [[2000]]
[[2000]]-[[01-01]]
[[2000-01-01]]
For each article, the following information is provided in addition to its text:
Number of bytes in the article text before and after preprocessing
Whether it is a stub
Number of categories, outgoing links, and URLs
List of categories (including those inherited from templates)
List of outgoing links
List of URLs
See Usage section above for the list of additional files produced by WikiPrep.
This software is distributed under the terms of GNU General Public License version 2.
The software is provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
If you publish results based on this code, please cite the following papers:
Well, it's free, so don't expect too much support :) I will likely be able to answer simple questions,
but not complex programming questions (please refer to your local Perl guru). I do not promise to correct bugs,
but I will try to do my best, especially if you suggest specific ways to correct the bug you encountered
(in which case your contribution will, of course, be acknowledged).