Wikipedia preprocessor

Maintained by: Evgeniy Gabrilovich (gabr@cs.technion.ac.il)

Overview
Description
Conditions of use
Support
References
Additional publications

Overview

Wikipedia is a terrific knowledge resource, and many recent studies in artificial intelligence, information retrieval and related fields used Wikipedia to endow computers with (some) human knowledge. Wikipedia dumps are publicly available in XML format, but they have a few shortcomings. First, they contain a lot of information that is often not used when Wikipedia texts are used as knowledge (e.g., ids of users who changed each article, timestamps of article modifications). On the other hand, the XML dumps do not contain a lot of useful information that could be inferred from the dump, such as link tables, category hierarchy, resolution of redirection links etc.

In my Ph.D. work, I developed a fairly extensive preprocessor of the standard Wikipedia XML dump into my own extended format, which eliminates some information and adds other useful information.

Description

The Wikipedia preprocessor is a single Perl script, which can be downloaded here. The software is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Prerequisites

To run the script, you will need to have the following Perl modules installed:

File::Basename
Time::localtime;
Parse::MediaWikiDump

Usage

perl wikipedia_preprocess.pl <XML-dump-file>

If the input file is named XXX.xml, then the following files will be produced:

XXX.hgw.xml - the extended XML format (see description below)
XXX.log - a HUGE log file with a lot of information about the preprocessing progress. Unless you're debugging the script, this file can almost always be safely deleted, as its size can easily reach tens of gigabytes. In fact, you should probably disable most of the commands in the script that send information to the log file.
XXX.anchor_text.sorted - anchor text associated with each internal link between Wikipedia articles
XXX.cat_hier - the hierarchy of Wikipedia categories
XXX.related_links - lists of related articles identified using contextual hints ("Further information", "Related topic", "See also" etc.)
XXX.stat.categories - number of pages in each Wikipedia category
XXX.stat.inlinks - number of incoming links for each article

To give you an idea of how the output files look like, here you can download a BZIP2 archive with the output files (except for the log file) produced for the November 11, 2005, snapshot of the English Wikipedia. The original snapshot itself is available here. Of course, you should also strive to use the latest Wikipedia snapshot; unfortunately, I do not have enough storage (nor bandwidth) to provide you with preprocessed versions of the latest dumps. The script has been tested on the snapshot dated July 19, 2007, and produced about 9Gb of output file (not counting the log file of about 40+ Gb).

Detailed description

- Resolve all redirection links, that is if B redirects to C, and A links to B, then A should link directly to C. After you resolve all redirections this way, you can build an entire link graph. Another reason to resolve links is that you can then use anchor text as additional source of knowledge. Further info on links & redirects: http://meta.wikimedia.org/wiki/Help:Redirect http://meta.wikimedia.org/wiki/Help:Link http://meta.wikimedia.org/wiki/Help:Piped_link http://meta.wikimedia.org/wiki/Help:URL http://meta.wikimedia.org/wiki/Help:What_links_here http://en.wikipedia.org/wiki/Wikipedia:URLs http://en.wikipedia.org/wiki/Wikipedia:What_links_here - When handling links, note that Wikipedia has a convention that only the first letter of the article title is case-sensitive, the remaining ones are lowercased, so you need to normalize the titles to be able to match identical ones. A similar story holds for the dates, as the following variants would all lead to the same article: 1) [[July 20]], [[1969]] 2) [[20 July]] [[1969]] 3) [[1969]]-[[07-20]] 4) [[1969-07-20]] - Many Wikipedia pages include templates, and some of them further include nested templates. Templates often contain valuable and relevant information, so it makes a lot of sense to try to resolve them, at least to some nesting depth (I resolved them to depth 5). The way I did it is to pre-scan the entire set of Wikipedia articles, and parse all templates, then scan the articles again and include the templates as needed. To this end, you'll need to understand how template parameters are inserted. Here are a few relevant URLs: http://en.wikipedia.org/wiki/Help:Template http://en.wikipedia.org/wiki/Help:A_quick_guide_to_templates http://en.wikipedia.org/wiki/Wikipedia:Transclusion_costs_and_benefits http://en.wikipedia.org/wiki/Wikipedia:Template_namespace Specifically, learn about and tags. - You'll probably want to remove stubs & other short articles: http://en.wikipedia.org/wiki/Wikipedia:Stub - Wikipedia has an elaborate system of namespaces to which articles belong. I'd recommend that you learn about the namespaces here: http://meta.wikimedia.org/wiki/Help:Namespace http://en.wikipedia.org/wiki/Wikipedia:Namespace http://en.wikipedia.org/wiki/Wikipedia:Project_namespace - More on Wikipedia data dumps: http://meta.wikimedia.org/wiki/Importing_a_Wikipedia_database_dump_into_MediaWiki http://en.wikipedia.org/wiki/Wikipedia:Download

Conditions of use

The software is distributed under ...

If you publish results based on this code, please cite the following papers:

Evgeniy Gabrilovich and Shaul Markovitch
"Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis"
Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, January 2007
[Abstract / PDF]
Evgeniy Gabrilovich and Shaul Markovitch
"Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge"
Proceedings of The 21st National Conference on Artificial Intelligence (AAAI), pp. 1301-1306, Boston, July 2006
[Abstract / PDF]

Please also inform your readers of the current location of the software: http://www.cs.technion.ac.il/~gabr/resources/code/wikipedia_preprocess.html

Support

Well, it's free, so don't expect too much support :) I will likely be able to answer simple questions, but not complex programming questions (please refer to your local Perl guru). I do not promise to correct bugs, but I will try to do my best, especially if you suggest specific ways to correct the bug you encountered (in which case your contribution will, of course, be acknowledged).

References

Evgeniy Gabrilovich and Shaul Markovitch
"Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis"
Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, January 2007
[Abstract / PDF]
Evgeniy Gabrilovich and Shaul Markovitch
"Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge"
Proceedings of The 21st National Conference on Artificial Intelligence (AAAI), pp. 1301-1306, Boston, July 2006
[Abstract / PDF]
Evgeniy Gabrilovich
"Feature Generation for Textual Information Retrieval Using World Knowledge"
PhD Thesis, Technion - Israel Institute of Technology, Haifa, Israel, December 2006
[Abstract / PDF]

Additional publications

If you are using the Wikipedia preprocessor in your scientific work and want your article(s) listed here, please email me at gabr@cs.technion.ac.il.

Your paper here ...

Evgeniy Gabrilovich
gabr@cs.technion.ac.il

Last updated on August 3, 2007