The Technion Repository of Text Categorization Datasets provides a large number of diverse test collections for use in text categorization research.
While numerous works studied text categorization (TC) in the past, good test collections are by far less abundant. This scarcity is mainly due to the huge manual effort required to collect a sufficiently large body of text, categorize it, and ultimately produce it in machine-readable format. Most studies use the Reuters-21578 collection as the primary benchmark. Others use 20 Newsgroups and OHSUMED, while TREC filtering experiments often use the data from the TIPSTER corpus (see below for links to these and other test collections).
In the past, developing a new dataset for text categorization required extensive manual effort to actually label the documents. However, given today proliferation of the Web, it seems reasonable to acquire large-scale real-life datasets from the Internet, subject to a set of constraints. Observe that Web directories that catalog Internet sites represent readily available results of enormous labeling projects. We therefore propose to capitalize on this body of information in order to derive new datasets in a fully automatic manner. This way, the directory serves as a source of URLs, while its hierarchical organization is used to label the documents collected from these URLs with corresponding directory categories. Since many Web directories continue to grow through ongoing development, we can expect the raw material for dataset generation to become even more abundant as the time passes.
In (Davidov et al., 2004) we proposed a methodology for automatic acquisition of up-to-date datasets with desired properties. The automatic aspect of acquisition facilitates creation of numerous test collections, effectively eliminating a considerable amount of human labor normally associated with preparing a dataset. At the same time, datasets that possess predefined characteristics allow researchers to exercise better control over TC experiments and to collect data geared towards their specific experimentation needs. Choosing these properties in different ways allows one to create focused datasets for improving TC performance in certain areas or under certain constraints, as well as to collect comprehensive datasets for exhaustive evaluation of TC systems.
After the data has been collected, the hierarchical structure of the directory may be used by classification algorithms as background world knowledge---the association between the data and the corresponding portion of the hierarchy is defined by virtue of dataset construction. The resulting datasets can be used for regular text categorization, hypertext categorization, as well as hierarchical text classification. Note also that many Web directories cross-link related categories using so-called "symbolic links", and using such links it is possible to construct datasets suitable for multi-labeled TC experiments.
We developed a software system named Accio that lets the user specify desired dataset parameters, and then efficiently locates suitable categories and collects documents associated with them. It should be observed that Web documents are far less fluent and clean compared to articles published in the "brick and mortar" world. To ensure the coherence of the data, Accio represents each Web site with several pages gathered from it through crawling, and filters the pages gathered both during and after the crawling. The final processing step computes a number of performance metrics for the generated dataset.
Using the proposed methodology, we have generated a large number of datasets based on the Open Directory Project, although the techniques we propose are readily applicable to other Web directories such as Yahoo!, as well as to non-Web hierarchies of documents. These datasets are organized in several test collections, which are made available through the current repository. This repository is constantly growing, and its growth rate is only limited by bandwidth and storage resources. We believe that having a wide variety of datasets in a centralized repository will allow researchers to perform a wide range of repeatable experiments. The Accio system that performs parameterized dataset acquisition from the Open Directory will be released at a later stage.
At this time, all the datasets contain two categories and are single-labeled, that is, every document belongs to exactly one category (we plan to relax this condition to facilitate multi-labeled datasets in our future work).
Each dataset consists of a pair of ODP categories with an average of 150-200 documents (depending on the specific test collection), and defines a binary classification task that consists in telling these two categories apart. When generating datasets from Web directories, where each category contains links to actual Internet sites, we construct text documents representative of those sites. Following the scheme introduced by Yang et al. (2002), each link cataloged in the ODP is used to obtain a small representative sample of the target Web site. To this end, we crawl the target site in the BFS order, starting from the URL listed in the directory. A predefined number of Web pages are downloaded, and concatenated into a synthetic document, which is then filtered to remove noise and HTML markup. We refer to these individual pages as sub-documents, since their concatenation yields one document for the categorization task. We usually refer to synthetic documents created by pooling sub-documents simply as documents to be consistent with text categorization terminology; alternatively, we call them meta-documents to avoid ambiguity when necessary. In this project we concatenated up to 5 first pages crawled in the BFS order from each site. The average document size after filtering is slightly over 11 Kilobytes.
Finally, HTML documents are converted into plain text and organized as a dataset, which we render in a simple XML-like format. It should be noted that converting HTML to text is not always perfect, since some small auxiliary text snippets (as found in menus and the like) may survive this procedure; we view such remnants as a (low) residual noise inherent in automated data acquisition.
Data collected from the Web can be quite noisy. Common examples of this noise are textual advertisements, numerous unrelated images, and text rendered in background color aimed at duping search engines. To reduce the amount of noise in generated datasets we employ filtering mechanisms before, during, and after downloading the data.
Pre-processing filtering eliminates certain categories from consideration. For example, we unconditionally disregard the entire Top/World subtree of the Open Directory that catalogs Web sites in languages other than English. Similarly, the Top/Adult subtree may be pruned to eliminate inappropriate adult content.
Recall that for every directory link we download a number of pages whose concatenation represents the corresponding Web site. Consequently, online filtering performed during the download restricts the crawler to the site linked from the directory, and does not allow it to pursue external links to other sites.
Post-processing filtering analyzes all the downloaded documents as a group, and selects the ones to be concatenated into the final meta-document. Two types of post-processing filtering are employed:
In plain text form, each dataset consists of a pair of files corresponding to the two categories comprising the dataset. Each file contains all the documents in one category in ASCII text format, which resulted from HTML-to-text conversion.
In our work (see section "References" below) we applied the following preprocessing steps to this representation:
If you are more interested in core machine learning and would rather not deal with preprocessing raw text, we also provide the datasets in the form of already preprocessed feature vectors.
In this format, texts were only tokenized and digitized, but underwent no other preprocessing whatsoever. Specifically:
Each dataset contains a pair of categories, which we uniformly call "positive" and "negative". Consequently, each dataset is comprised of two ASCII text files, each containing the documents labeled with one category; these files are named "all_pos.txt" and "all_neg.txt".
Each of these files has the following structure:
<dmoz_doc>
id=xxx
<dmoz_subdoc>
...
</dmoz_subdoc>
<dmoz_subdoc>
...
</dmoz_subdoc>
<dmoz_subdoc>
...
</dmoz_subdoc>
<dmoz_subdoc>
...
</dmoz_subdoc>
<dmoz_subdoc>
...
</dmoz_subdoc>
</dmoz_doc>
...
<dmoz_doc>
id=xxx
...
</dmoz_doc>
Each document is enclosed in a pair of tags <dmoz_doc> ... </dmoz_doc>. Document ids are specified as id=xxx, where xxx are unique integer numbers (not necessarily consecutive). As explained in section "Data acquisition procedure" above, each document was constructed by concatenating up to 5 Web pages crawled starting with an ODP link. This structure is reflected through the <dmoz_subdoc> ... </dmoz_subdoc> tags, which enclose these individual pages (called subdocuments). A document may have less than 5 subdocuments if the corresponding Web site didn't have that many pages at crawling time. Occasionally, some subdocuments may be empty, which corresponds to the case that the original Web page didn't have any text left after HTML to text filtering.
Feature vectors are given in the format of SVMlight, and each vector is given in the following form:
<vector> .=. <class> <feature>:<value> <feature>:<value> ... <feature>:<value>
<class> .=. +1 | -1
<feature> .=. <integer>
<value> .=. <float>
The class value and each of the feature/value pairs are separated with spaces. Feature/value pairs are listed in the increasing order of feature ids. Features with zero values are omitted skipped.
The class value denotes the class of the example, namely, +1 marks a positive example and -1 - a negative example, respectively. For example, the line
-1 1:2 3:4 9284:3specifies a negative example for which feature number 1 has the value 2, feature number 3 has the value 4, feature number 9284 has the value 3, and all the other features have the value 0.
Dmitry Davidov, Evgeniy Gabrilovich, and Shaul Markovitch
"Parameterized Generation of Labeled Datasets for Text Categorization Based on a Hierarchical Directory"
The 27th Annual International ACM SIGIR Conference, pp. 250-257, Sheffield, UK, July 2004
[Abstract / PDF]
Evgeniy Gabrilovich and Shaul Markovitch
"Text Categorization with Many Redundant Features: Using Aggressive Feature Selection to Make SVMs Competitive with C4.5"
The 21st International Conference on Machine Learning (ICML), pp. 321-328, Banff, Alberta, Canada, July 2004
[Abstract / PDF]
Please also inform your readers of the current location of the data: http://techtc.cs.technion.ac.il
Evgeniy Gabrilovich
gabr@cs.technion.ac.il
Last updated on August 24, 2011