TechTC - Technion Repository of Text Categorization Datasets

The TechTC-100 Test Collection for Text Categorization

Version: 1.0
Release date: April 14, 2004
Maintained by: Evgeniy Gabrilovich (gabr@cs.technion.ac.il)
  1. Overview
  2. Availability and usage
  3. Detailed statistics
  4. Mailing list
  5. Questions?
  6. References
  7. Additional publications

Overview

The TechTC-100 Test Collection contains 100 labeled datasets whose categorization difficulty (as measured by baseline SVM accuracy) is uniformly distributed between 0.6 and 0.92. This test collection was used for the experiments in feature selection for text categorization described in (Gabrilovich and Markovitch, 2004).

The data acquisition procedure and the format of the data files for this collection are comprehensively described at the main TechTC page.

Availability and usage

Download the entire test collection as a single ZIP file:

Section "Detailed statistics" below contains further information on individual datasets.

Conditions of use

If you publish results based on this test collection, please cite the following paper:
Evgeniy Gabrilovich and Shaul Markovitch
"Text Categorization with Many Redundant Features: Using Aggressive Feature Selection to Make SVMs Competitive with C4.5"
The 21st International Conference on Machine Learning (ICML), pp. 321-328, Banff, Alberta, Canada, July 2004
[Abstract / PDF]

Please also inform your readers of the current location of the data: http://techtc.cs.technion.ac.il/techtc100/techtc100.html

newSoftware

Nil Geisweiller has kindly made available his software, which allows to create datasets based on the ODP: https://github.com/ngeiswei/techtc-builder

Detailed statistics

The following table provides detailed information about the individual datasets in the TechTC-100 collection. Each line in this table describes one dataset. The first column shows the dataset ordinal number, followed by two columns that specify the ids of the ODP categories comprising the dataset (hyperlinked with the corresponding categories in the ODP hierarchy). The fourth column shows the number of documents in the dataset. The next three columns show the accuracy of text categorization with SVM, C4.5 and KNN (respectively) using all the features (i.e., without any feature selection). The last three columns show the accuracy of text categorization with the optimal feature selection level for each classifier.

No. Category
id1
Category
id2
Number of
documents
SVM
(100%)
C4.5
(100%)
KNN
(100%)
SVM
(0.5%)
C4.5
(0.5%)
KNN
(2%)
1 1622 42350 163 0.8625 0.7725 0.838 0.74375 0.750 0.729
2 6920 8366 140 0.919 0.897 0.927 0.897 0.897 0.907
3 8308 8366 144 0.8285 0.743 0.793 0.8285 0.807 0.862
4 10341 10755 145 0.77075 0.811 0.708 0.8195 0.792 0.794
5 10341 14271 145 0.6805 0.78775 0.618 0.81275 0.820 0.796
6 10341 14525 158 0.58975 0.5065 0.590 0.564 0.615 0.514
7 10341 61792 153 0.7565 0.80275 0.717 0.79575 0.822 0.810
8 10341 186330 147 0.70125 0.8 0.729 0.8475 0.854 0.826
9 10341 194927 159 0.73725 0.7695 0.737 0.80125 0.769 0.821
10 10350 10539 157 0.69875 0.75 0.673 0.7885 0.808 0.760
11 10350 13928 148 0.75 0.66575 0.681 0.83325 0.792 0.815
12 10350 194915 154 0.7305 0.78025 0.678 0.85525 0.849 0.875
13 10385 14525 156 0.829 0.65725 0.798 0.8945 0.836 0.852
14 10385 25326 153 0.87175 0.8515 0.831 0.87175 0.885 0.865
15 10385 269078 153 0.80425 0.892 0.734 0.946 0.919 0.848
16 10385 299104 147 0.799 0.54525 0.780 0.736 0.722 0.753
17 10385 312035 145 0.65 0.53625 0.655 0.48575 0.519 0.543
18 10539 10567 152 0.65775 0.855 0.638 0.8685 0.921 0.723
19 10539 11346 155 0.84225 0.875 0.763 0.9475 0.921 0.827
20 10539 20673 152 0.67775 0.81575 0.664 0.86825 0.895 0.855
21 10539 61792 161 0.65625 0.78125 0.569 0.80625 0.775 0.773
22 10539 85489 155 0.80925 0.671 0.789 0.783 0.790 0.770
23 10539 186330 155 0.65775 0.645 0.651 0.763 0.763 0.792
24 10539 194915 165 0.7375 0.8535 0.658 0.89625 0.897 0.803
25 10539 300332 164 0.76825 0.85225 0.725 0.9085 0.896 0.860
26 10567 11346 139 0.794 0.9045 0.713 0.9635 0.964 0.941
27 10567 12121 138 0.7795 0.934 0.747 0.9635 0.956 0.912
28 10567 46076 142 0.65 0.9 0.593 0.94275 0.936 0.879
29 11346 17360 140 0.8235 0.93375 0.765 0.956 0.934 0.919
30 11346 22294 125 0.85025 0.8925 0.825 0.93325 0.908 0.933
31 11498 14517 125 0.9165 0.89175 0.892 0.95 0.909 0.917
32 13928 18479 151 0.75025 0.8245 0.737 0.811 0.798 0.845
33 13928 71892 154 0.892 0.8245 0.838 0.9055 0.892 0.837
34 13928 186330 146 0.71425 0.743 0.736 0.80725 0.786 0.825
35 13928 300332 155 0.83525 0.7735 0.809 0.90775 0.855 0.888
36 13928 312035 146 0.7215 0.74275 0.679 0.7645 0.772 0.766
37 14271 20186 145 0.67375 0.87675 0.521 0.90275 0.951 0.807
38 14271 46076 143 0.6785 0.735 0.693 0.78575 0.786 0.806
39 14271 194927 152 0.69625 0.79075 0.722 0.84475 0.831 0.847
40 14271 312035 140 0.7575 0.79425 0.735 0.79425 0.787 0.835
41 14517 20673 127 0.89525 0.89475 0.790 0.93525 0.944 0.903
42 14517 186330 130 0.86275 0.8505 0.825 0.911 0.936 0.836
43 14525 61792 159 0.73725 0.8015 0.731 0.827 0.808 0.750
44 14525 194927 165 0.66875 0.75 0.688 0.80625 0.819 0.870
45 14630 18479 157 0.8075 0.86025 0.795 0.90375 0.859 0.891
46 14630 20186 157 0.59625 0.94875 0.468 0.92925 0.949 0.833
47 14630 94142 154 0.77625 0.7985 0.829 0.90125 0.887 0.882
48 14630 300332 161 0.90625 0.88125 0.844 0.9 0.888 0.872
49 14630 312035 152 0.72325 0.8515 0.717 0.89875 0.885 0.773
50 14630 814096 163 0.8625 0.875 0.838 0.9375 0.925 0.906
51 17360 20186 145 0.597 0.875 0.563 0.90275 0.931 0.852
52 17360 46875 150 0.6895 0.73675 0.656 0.88525 0.858 0.872
53 18479 20186 152 0.64475 0.93425 0.533 0.921 0.928 0.875
54 18479 20673 144 0.75 0.87475 0.660 0.8475 0.847 0.845
55 18479 46076 150 0.6625 0.757 0.710 0.8245 0.791 0.799
56 18479 186330 147 0.729 0.72475 0.709 0.80575 0.785 0.766
57 20186 22294 130 0.78125 0.89075 0.704 0.93 0.907 0.880
58 20186 61792 153 0.592 0.89175 0.582 0.84225 0.928 0.731
59 20673 46076 142 0.70725 0.793 0.729 0.88575 0.907 0.872
60 20673 269078 147 0.71525 0.8815 0.625 0.9375 0.938 0.903
61 20673 312035 139 0.6395 0.78675 0.618 0.8385 0.838 0.812
62 22294 25575 127 0.9275 0.88275 0.911 0.95975 0.936 0.870
63 22294 46076 128 0.8065 0.92075 0.742 0.9435 0.952 0.883
64 25575 47456 143 0.88575 0.8785 0.879 0.9 0.890 0.876
65 25575 275169 151 0.8785 0.88525 0.811 0.91225 0.892 0.882
66 25936 94142 144 0.86425 0.7715 0.843 0.8645 0.836 0.820
67 46076 61792 151 0.676 0.80425 0.642 0.7975 0.798 0.807
68 46875 61792 158 0.72425 0.79475 0.680 0.88475 0.859 0.802
69 47418 814096 155 0.77625 0.86825 0.698 0.90125 0.911 0.881
70 47456 497201 131 0.8515 0.82025 0.836 0.9065 0.888 0.906
71 58108 85489 147 0.8405 0.764 0.854 0.8405 0.778 0.809
72 61792 814096 159 0.91025 0.87175 0.853 0.94875 0.923 0.873
73 69753 85489 156 0.875 0.86175 0.908 0.9275 0.908 0.890
74 85489 90753 154 0.865 0.80425 0.903 0.838 0.831 0.845
75 186330 46076 145 0.65 0.7155 0.672 0.8215 0.764 0.832
76 186330 94142 144 0.89975 0.74275 0.886 0.84275 0.814 0.850
77 186330 195558 139 0.93375 0.91175 0.956 0.93375 0.883 0.902
78 186330 300332 151 0.7975 0.798 0.804 0.8515 0.798 0.858
79 186330 314499 146 0.88575 0.74525 0.893 0.87125 0.829 0.857
80 194915 20186 157 0.40375 0.561 0.547 0.3975 0.603 0.442
81 194915 67777 153 0.90775 0.9145 0.888 0.97375 0.954 0.934
82 194915 194927 164 0.63125 0.70625 0.613 0.7625 0.706 0.639
83 194915 324745 152 0.68275 0.84475 0.548 0.84475 0.852 0.875
84 194927 20186 159 0.67325 0.8845 0.673 0.83325 0.904 0.759
85 194927 46875 164 0.725 0.78125 0.656 0.8375 0.825 0.833
86 194927 61792 160 0.7245 0.7435 0.744 0.73075 0.782 0.696
87 194927 299104 156 0.83525 0.4815 0.717 0.7895 0.762 0.778
88 194927 312035 154 0.79075 0.7975 0.798 0.80425 0.798 0.800
89 269078 46076 153 0.8245 0.9055 0.771 0.95275 0.939 0.871
90 269078 324745 150 0.8055 0.872 0.715 0.90975 0.896 0.903
91 299104 46076 147 0.729 0.61825 0.743 0.785 0.750 0.796
92 299104 58108 149 0.91225 0.84475 0.865 0.865 0.825 0.888
93 299104 312035 144 0.69275 0.53875 0.650 0.7215 0.614 0.764
94 300332 85489 151 0.8245 0.57125 0.865 0.7975 0.628 0.783
95 316970 85489 145 0.85025 0.79275 0.822 0.83575 0.843 0.839
96 324745 61792 148 0.7015 0.85425 0.646 0.90975 0.910 0.784
97 324745 85489 142 0.8235 0.875 0.846 0.89725 0.912 0.831
98 332386 61792 159 0.846 0.7235 0.842 0.782 0.737 0.863
99 332386 85489 153 0.838 0.76275 0.854 0.7975 0.757 0.811
100 364836 71892 142 0.8825 0.87925 0.875 0.912 0.902 0.912


Mailing list

To receive periodic updates and to participate in discussions on TechTC, please subscribe to the TectTC mailing list at http://groups.yahoo.com/group/techtc.

Questions ?

If you have questions or comments, please post them to the mailing list (see above), or email me directly at gabr@cs.technion.ac.il.

References

  1. Dmitry Davidov, Evgeniy Gabrilovich, and Shaul Markovitch
    "Parameterized Generation of Labeled Datasets for Text Categorization Based on a Hierarchical Directory"
    The 27th Annual International ACM SIGIR Conference, pp. 250-257, Sheffield, UK, July 2004
    [Abstract / PDF]

  2. Evgeniy Gabrilovich and Shaul Markovitch
    "Text Categorization with Many Redundant Features: Using Aggressive Feature Selection to Make SVMs Competitive with C4.5"
    The 21st International Conference on Machine Learning (ICML), pp. 321-328, Banff, Alberta, Canada, July 2004
    [Abstract / PDF]

Additional publications

If you are using this test collection and want your article(s) listed here, please email me at gabr@cs.technion.ac.il.
  1. Your paper here ...

Evgeniy Gabrilovich
gabr@cs.technion.ac.il

Last updated on August 24, 2011