Wikipedia Preprocessor (WikiPrep)

Developed and maintained by Evgeniy Gabrilovich (gabr@cs.technion.ac.il)

Overview
Description
Conditions of use
Support
References

News

The code is being slowly moved to SourceForge.net, where it will be hosted as project WikiPrep. Stay tuned !

Overview

Wikipedia is a terrific knowledge resource, and many recent studies in artificial intelligence, information retrieval and related fields used Wikipedia to endow computers with (some) human knowledge. Wikipedia dumps are publicly available in XML format, but they have a few shortcomings. First, they contain a lot of information that is often not used when Wikipedia texts are used as knowledge (e.g., ids of users who changed each article, timestamps of article modifications). On the other hand, the XML dumps do not contain a lot of useful information that could be inferred from the dump, such as link tables, category hierarchy, resolution of redirection links etc.

In the course of my Ph.D. work, I developed a fairly extensive preprocessor of the standard Wikipedia XML dump into my own extended XML format, which eliminates some information and adds other useful information.

Description

WikiPrep is a single Perl script, which can be downloaded here.

Prerequisites

To run the script, you will need to have the following Perl modules installed on your system:

File::Basename
Getopt::Long
Time::localtime
Parse::MediaWikiDump

Usage

perl wikiprep.pl -f <XML-dump-file>

If the input file is named XXX.xml, then the following files will be produced:

XXX.hgw.xml - the extended XML format (see description below)
XXX.log - a HUGE log file with a lot of information about the preprocessing progress. Unless you're debugging the script, this file can almost always be safely deleted, as its size can easily reach tens of gigabytes. In fact, you should probably disable most of the commands in the script that send information to the log file.
XXX.anchor_text.sorted - anchor text associated with each internal link between Wikipedia articles
XXX.cat_hier - the hierarchy of Wikipedia categories
XXX.related_links - lists of related articles identified using contextual hints ("Further information", "Related topic", "See also" etc.)
XXX.stat.categories - number of pages in each Wikipedia category
XXX.stat.inlinks - number of incoming links for each article

What else is available

wikipedia-051105-preprocessed.tar.bz2 (700+ Mb) - to give you an idea how the output files look like, this BZIP2 archive contains a set of output files (except for the log file) produced for the November 11, 2005, snapshot of the English Wikipedia.
sample.hgw.xml (1 Mb) - a small sample file with a few Wikipedia articles preprocessed by WikiPrep. To see examples of all the other goodies produced by WikiPrep (link structure, category hierarchy, anchor text etc.), you'll need to download the full example here.

Note: Of course, you should always strive to use the latest Wikipedia snapshot; unfortunately, I do not have enough storage (nor bandwidth) capacity to provide you with preprocessed versions of the latest dumps. The script has been tested on the snapshot dated July 19, 2007, and produced about 9 Gb worth of output files (not counting the log file of about 40+ Gb).

Detailed description

The preprocessor script accomplishes the following tasks:

Many Wikipedia pages include templates, and some of them further include nested templates. Templates often contain valuable and relevant information, so it makes a lot of sense to try to resolve them, at least to some nesting depth (the script resolves templates to depth 5). To achieve this aim, the script first pre-scans the entire set of Wikipedia articles and parses all the templates, then scans the articles again and embeds the templates as needed. In particular, the script correctly processes <noinclude> and <includeonly> tags.
Caveat: templates often get messy and hard to parse, therefore, you will occasionally find some artefacts in the preprocessed article text, such as this sequence "]]}}". However, in the absolute majority of cases, ignoring such artefacts is straightforward, while on the other hand developing a fool-proof preprocessor would be a very expensive enterprise (and probably an overkill).
Resolve all redirection links, that is, if B redirects to C, and A links to B, then A should link directly to C. After all redirections are resolved this way, one can easily build the entire link graph. Another reason to resolve links is that one can then easily collect all anchor text for each article, and use anchor text as additional source of knowledge.
Wikipedia has a convention that only the first letter of the article title is case-sensitive, the remaining ones are lowercased, so the script normalizes the titles to be able to match identical ones. Dates also have to be normalized, for example, the following variants would all lead to the same article:
- [[January 1]], [[2000]]
- [[1 January]] [[2000]]
- [[2000]]-[[01-01]]
- [[2000-01-01]]
For each article, the following information is provided in addition to its text:
- Number of bytes in the article text before and after preprocessing
- Whether it is a stub
- Number of categories, outgoing links, and URLs
- List of categories (including those inherited from templates)
- List of outgoing links
- List of URLs
See Usage section above for the list of additional files produced by WikiPrep.

Running time

Wikipedia dumps are huge, so preprocessing them takes time. Just to give you an idea of what to expect (on a dual-core ~2 GHz Intel computer):

November 11, 2005 snapshot (original XML dump - 3.4 Gb) ~ 5.5 hours
July 19, 2007 snapshot (original XML dump - 12 Gb) ~ 43.5 hours

Conditions of use

This software is distributed under the terms of GNU General Public License version 2. The software is provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

If you publish results based on this code, please cite the following papers:

Evgeniy Gabrilovich and Shaul Markovitch
"Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis"
Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, January 2007
Evgeniy Gabrilovich and Shaul Markovitch
"Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge"
Proceedings of The 21st National Conference on Artificial Intelligence (AAAI), pp. 1301-1306, Boston, July 2006

Please also inform your readers of the current location of the software: http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep

Support

Well, it's free, so don't expect too much support :) I will likely be able to answer simple questions, but not complex programming questions (please refer to your local Perl guru). I do not promise to correct bugs, but I will try to do my best, especially if you suggest specific ways to correct the bug you encountered (in which case your contribution will, of course, be acknowledged).

References

Evgeniy Gabrilovich and Shaul Markovitch
"Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis"
Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, January 2007
[Abstract / PDF]
Evgeniy Gabrilovich and Shaul Markovitch
"Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge"
Proceedings of The 21st National Conference on Artificial Intelligence (AAAI), pp. 1301-1306, Boston, July 2006
[Abstract / PDF]
Evgeniy Gabrilovich
"Feature Generation for Textual Information Retrieval Using World Knowledge"
PhD Thesis, Technion - Israel Institute of Technology, Haifa, Israel, December 2006
[Abstract / PDF]

Evgeniy Gabrilovich
gabr@cs.technion.ac.il

Last updated on November 2, 2010