TechTC - Technion Repository of Text Categorization Datasets

Maintained by: Evgeniy Gabrilovich (gabr@cs.technion.ac.il)
  1. Overview
  2. Description
  3. Availability and usage
  4. Mailing list
  5. Questions?
  6. References
  7. Additional publications
  8. Other test collections for text categorization

Overview

The Technion Repository of Text Categorization Datasets provides a large number of diverse test collections for use in text categorization research.

Background

While numerous works studied text categorization (TC) in the past, good test collections are by far less abundant. This scarcity is mainly due to the huge manual effort required to collect a sufficiently large body of text, categorize it, and ultimately produce it in machine-readable format. Most studies use the Reuters-21578 collection as the primary benchmark. Others use 20 Newsgroups and OHSUMED, while TREC filtering experiments often use the data from the TIPSTER corpus (see below for links to these and other test collections).

In the past, developing a new dataset for text categorization required extensive manual effort to actually label the documents. However, given today proliferation of the Web, it seems reasonable to acquire large-scale real-life datasets from the Internet, subject to a set of constraints. Observe that Web directories that catalog Internet sites represent readily available results of enormous labeling projects. We therefore propose to capitalize on this body of information in order to derive new datasets in a fully automatic manner. This way, the directory serves as a source of URLs, while its hierarchical organization is used to label the documents collected from these URLs with corresponding directory categories. Since many Web directories continue to grow through ongoing development, we can expect the raw material for dataset generation to become even more abundant as the time passes.

In (Davidov et al., 2004) we proposed a methodology for automatic acquisition of up-to-date datasets with desired properties. The automatic aspect of acquisition facilitates creation of numerous test collections, effectively eliminating a considerable amount of human labor normally associated with preparing a dataset. At the same time, datasets that possess predefined characteristics allow researchers to exercise better control over TC experiments and to collect data geared towards their specific experimentation needs. Choosing these properties in different ways allows one to create focused datasets for improving TC performance in certain areas or under certain constraints, as well as to collect comprehensive datasets for exhaustive evaluation of TC systems.

After the data has been collected, the hierarchical structure of the directory may be used by classification algorithms as background world knowledge---the association between the data and the corresponding portion of the hierarchy is defined by virtue of dataset construction. The resulting datasets can be used for regular text categorization, hypertext categorization, as well as hierarchical text classification. Note also that many Web directories cross-link related categories using so-called "symbolic links", and using such links it is possible to construct datasets suitable for multi-labeled TC experiments.

We developed a software system named Accio that lets the user specify desired dataset parameters, and then efficiently locates suitable categories and collects documents associated with them. It should be observed that Web documents are far less fluent and clean compared to articles published in the "brick and mortar" world. To ensure the coherence of the data, Accio represents each Web site with several pages gathered from it through crawling, and filters the pages gathered both during and after the crawling. The final processing step computes a number of performance metrics for the generated dataset.

Using the proposed methodology, we have generated a large number of datasets based on the Open Directory Project, although the techniques we propose are readily applicable to other Web directories such as Yahoo!, as well as to non-Web hierarchies of documents. These datasets are organized in several test collections, which are made available through the current repository. This repository is constantly growing, and its growth rate is only limited by bandwidth and storage resources. We believe that having a wide variety of datasets in a centralized repository will allow researchers to perform a wide range of repeatable experiments. The Accio system that performs parameterized dataset acquisition from the Open Directory will be released at a later stage.

Description

At this time, all the datasets contain two categories and are single-labeled, that is, every document belongs to exactly one category (we plan to relax this condition to facilitate multi-labeled datasets in our future work).

Data acquisition procedure

Each dataset consists of a pair of ODP categories with an average of 150-200 documents (depending on the specific test collection), and defines a binary classification task that consists in telling these two categories apart. When generating datasets from Web directories, where each category contains links to actual Internet sites, we construct text documents representative of those sites. Following the scheme introduced by Yang et al. (2002), each link cataloged in the ODP is used to obtain a small representative sample of the target Web site. To this end, we crawl the target site in the BFS order, starting from the URL listed in the directory. A predefined number of Web pages are downloaded, and concatenated into a synthetic document, which is then filtered to remove noise and HTML markup. We refer to these individual pages as sub-documents, since their concatenation yields one document for the categorization task. We usually refer to synthetic documents created by pooling sub-documents simply as documents to be consistent with text categorization terminology; alternatively, we call them meta-documents to avoid ambiguity when necessary. In this project we concatenated up to 5 first pages crawled in the BFS order from each site. The average document size after filtering is slightly over 11 Kilobytes.

Finally, HTML documents are converted into plain text and organized as a dataset, which we render in a simple XML-like format. It should be noted that converting HTML to text is not always perfect, since some small auxiliary text snippets (as found in menus and the like) may survive this procedure; we view such remnants as a (low) residual noise inherent in automated data acquisition.

Filtering the raw data to cope with noise

Data collected from the Web can be quite noisy. Common examples of this noise are textual advertisements, numerous unrelated images, and text rendered in background color aimed at duping search engines. To reduce the amount of noise in generated datasets we employ filtering mechanisms before, during, and after downloading the data.

Pre-processing filtering eliminates certain categories from consideration. For example, we unconditionally disregard the entire Top/World subtree of the Open Directory that catalogs Web sites in languages other than English. Similarly, the Top/Adult subtree may be pruned to eliminate inappropriate adult content.

Recall that for every directory link we download a number of pages whose concatenation represents the corresponding Web site. Consequently, online filtering performed during the download restricts the crawler to the site linked from the directory, and does not allow it to pursue external links to other sites.

Post-processing filtering analyzes all the downloaded documents as a group, and selects the ones to be concatenated into the final meta-document. Two types of post-processing filtering are employed:

  1. Weak filtering discards Web pages that contain HTTP error messages, or have less than some predefined number of words.
  2. Strong filtering attempts to eliminate unrelated pages that do not adequately represent the site they were collected from (e.g., legal notices or discussion forum rules). To eliminate such pages, we try to identify obvious outliers. We use the root page of a Web site (i.e., the page linked from the directory) as a "model" deemed to be representative of the site as a whole. Whenever the root page contains enough text for comparison, we use the text distance metric developed in (Davidov et al., 2004; Section 2.1.3) to compute the distance between it and every other page downloaded from the site. We then discard all pages located "further" from the root than one standard deviation above the average.

Data encoding format

The data is available in two formats:
  1. Plain text

    In plain text form, each dataset consists of a pair of files corresponding to the two categories comprising the dataset. Each file contains all the documents in one category in ASCII text format, which resulted from HTML-to-text conversion.

    In our work (see section "References" below) we applied the following preprocessing steps to this representation:

  2. Preprocessed feature vectors

    If you are more interested in core machine learning and would rather not deal with preprocessing raw text, we also provide the datasets in the form of already preprocessed feature vectors.

    In this format, texts were only tokenized and digitized, but underwent no other preprocessing whatsoever. Specifically:

Plain text format

Each dataset contains a pair of categories, which we uniformly call "positive" and "negative". Consequently, each dataset is comprised of two ASCII text files, each containing the documents labeled with one category; these files are named "all_pos.txt" and "all_neg.txt".

Each of these files has the following structure:

<dmoz_doc>
id=xxx

<dmoz_subdoc>
...
</dmoz_subdoc>

<dmoz_subdoc>
...
</dmoz_subdoc>

<dmoz_subdoc>
...
</dmoz_subdoc>

<dmoz_subdoc>
...
</dmoz_subdoc>

<dmoz_subdoc>
...
</dmoz_subdoc>

</dmoz_doc>
...
<dmoz_doc>
id=xxx
...
</dmoz_doc>

Each document is enclosed in a pair of tags <dmoz_doc> ... </dmoz_doc>. Document ids are specified as id=xxx, where xxx are unique integer numbers (not necessarily consecutive). As explained in section "Data acquisition procedure" above, each document was constructed by concatenating up to 5 Web pages crawled starting with an ODP link. This structure is reflected through the <dmoz_subdoc> ... </dmoz_subdoc> tags, which enclose these individual pages (called subdocuments). A document may have less than 5 subdocuments if the corresponding Web site didn't have that many pages at crawling time. Occasionally, some subdocuments may be empty, which corresponds to the case that the original Web page didn't have any text left after HTML to text filtering.

Preprocessed feature vectors

Each dataset is given as a pair of files:
  1. File "vectors.dat" contains the feature vectors in ASCII text format. The file starts with two comment lines that begin with a "#" sign. Thereafter, the file contains pairs of lines where the first line in each pair is a comment line starting with a "#" sign and containing a document id, and the second line contains the document encoding as a feature vector.

    Feature vectors are given in the format of SVMlight, and each vector is given in the following form:

    <vector> .=. <class> <feature>:<value> <feature>:<value> ... <feature>:<value>
    <class> .=. +1 | -1
    <feature> .=. <integer>
    <value> .=. <float>

    The class value and each of the feature/value pairs are separated with spaces. Feature/value pairs are listed in the increasing order of feature ids. Features with zero values are omitted skipped.

    The class value denotes the class of the example, namely, +1 marks a positive example and -1 - a negative example, respectively. For example, the line

    -1 1:2 3:4 9284:3
    specifies a negative example for which feature number 1 has the value 2, feature number 3 has the value 4, feature number 9284 has the value 3, and all the other features have the value 0.

  2. File "features.idx" contains a list of features and their ids. The file starts with a number of comment lines that begin with a "#" sign and explain the file format. The rest of the file contains a list of all the features in the dataset, where each line contains an integer feature id and then the feature itself. If you are only interested in feeding the feature vectors to your favourite machine learning algorithm, then this file is obviously not necessary. However, you can use the information provided in this file if you decide to apply some text preprocessing steps such as stemming or stop-word removal.

Availability and usage

The following test collections are currently available:

Conditions of use

If you publish results based on these test collections, please cite the following papers:

Please also inform your readers of the current location of the data: http://techtc.cs.technion.ac.il

newSoftware

Nil Geisweiller has kindly made available his software, which allows to create datasets based on the ODP: https://github.com/ngeiswei/techtc-builder

Mailing list

To receive periodic updates and to participate in discussions on TechTC, please subscribe to the TectTC mailing list at http://groups.yahoo.com/group/techtc.

Questions?

If you have questions or comments, please post them to the mailing list (see above), or email me directly at gabr@cs.technion.ac.il.

References

  1. Dmitry Davidov, Evgeniy Gabrilovich, and Shaul Markovitch
    "Parameterized Generation of Labeled Datasets for Text Categorization Based on a Hierarchical Directory"
    The 27th Annual International ACM SIGIR Conference, pp. 250-257, Sheffield, UK, July 2004
    [Abstract / PDF]

  2. Evgeniy Gabrilovich and Shaul Markovitch
    "Text Categorization with Many Redundant Features: Using Aggressive Feature Selection to Make SVMs Competitive with C4.5"
    The 21st International Conference on Machine Learning (ICML), pp. 321-328, Banff, Alberta, Canada, July 2004
    [Abstract / PDF]

Additional publications

If you are using either of these test collections and want your article(s) listed here, please email me at gabr@cs.technion.ac.il.
  1. Your paper here ...

Other test collections for text categorization

  1. Reuters-21578
  2. Reuters Corpus Volume 1 (RCV1)
  3. 20 Newsgroups
  4. Movie Reviews
  5. OHSUMED
  6. TREC data at NIST
  7. Topic Detection and Tracking (TDT)

Evgeniy Gabrilovich
gabr@cs.technion.ac.il

Last updated on August 24, 2011