This is an online appendix for the paper "Feature Generation for Text Categorization Using World Knowledge" by Evgeniy Gabrilovich and Shaul Markovitch, Nineteenth International Joint Conference on Artificial Intelligence (IJCAI), Edinburgh, Scotland, UK, August 2005 [PDF]
We enhance machine learning algorithms for text categorization with generated features based on domain-specific and common-sense knowledge. This knowledge is represented using publicly available ontologies that contain hundreds of thousands of concepts, such as the Open Directory; these ontologies are further enriched by several orders of magnitude through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts, which in turn induce a set of generated features that augment the standard bag of words. Feature generation is accomplished through contextual analysis of document text, implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses the two main problems of natural language processing---synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the documents alone. Experimental results confirm improved performance, breaking through the plateau previously reached in the field.
Here we provide the details of the datasets that have been omitted from the paper owing to lack of space.
Dataset | Categories comprising the dataset |
---|---|
Topic-16 (RCV1) | e142, gobit, e132, c313, e121, godd, ghea, e13, c183, m143, gspo, c13, e21, gpol, m14, c15 |
Topic-10A (RCV1) | e31, c41, c151, c313, c31, m13, ecat, c14, c331, c33 |
Topic-10B (RCV1) | m132, c173, g157, gwea, grel, c152, e311, c21, e211, c16 |
Topic-10C (RCV1) | c34, c13, gtour, c311, g155, gdef, e21, genv, e131, c17 |
Industry-16 (RCV1) | i81402, i79020, i75000, i25700, i83100, i16100, i1300003, i14000, i3302021, i8150206, i0100132, i65600, i3302003, i8150103, i3640010, i9741102 |
Industry-10A (RCV1) | i47500, i5010022, i3302021, i46000, i42400, i45100, i32000, i81401, i24200, i77002 |
Industry-10B (RCV1) | i25670, i61000, i81403, i34350, i1610109, i65600, i3302020, i25700, i47510, i9741110 |
Industry-10C (RCV1) | i25800, i41100, i42800, i16000, i24800, i02000, i34430, i36101, i24300, i83100 |
Evgeniy Gabrilovich
gabr@cs.technion.ac.il
Last updated on April 4, 2005