Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge

This is an online appendix for the paper "Feature Generation for Text Categorization Using World Knowledge" by Evgeniy Gabrilovich and Shaul Markovitch, Twenty-First National Conference on Artificial Intelligence (AAAI), Boston, MA, July 2006 [PDF]


Here we provide the details of the datasets that have been omitted from the paper owing to lack of space.

Dataset Categories comprising the dataset
Topic-16 (RCV1) e142, gobit, e132, c313, e121, godd, ghea, e13, c183, m143, gspo, c13, e21, gpol, m14, c15
Topic-10A (RCV1) e31, c41, c151, c313, c31, m13, ecat, c14, c331, c33
Topic-10B (RCV1) m132, c173, g157, gwea, grel, c152, e311, c21, e211, c16
Topic-10C (RCV1) c34, c13, gtour, c311, g155, gdef, e21, genv, e131, c17
Topic-10D (RCV1) c23, c411, e13, gdis, c12, c181, gpro, c15, g15, c22
Topic-10E (RCV1) c172, e513, e12, ghea, c183, gdip, m143, gcrim, e11, gvio
Industry-16 (RCV1) i81402, i79020, i75000, i25700, i83100, i16100, i1300003, i14000, i3302021, i8150206, i0100132, i65600, i3302003, i8150103, i3640010, i9741102
Industry-10A (RCV1) i47500, i5010022, i3302021, i46000, i42400, i45100, i32000, i81401, i24200, i77002
Industry-10B (RCV1) i25670, i61000, i81403, i34350, i1610109, i65600, i3302020, i25700, i47510, i9741110
Industry-10C (RCV1) i25800, i41100, i42800, i16000, i24800, i02000, i34430, i36101, i24300, i83100
Industry-10D (RCV1) i1610107, i97400, i64800, i0100223, i48300, i81502, i34400, i82000, i42700, i81402
Industry-10E (RCV1) i33020, i82003, i34100, i66500, i1300014, i34531, i16100, i22450, i22100, i42900
OHSUMED-10A B-Lymphocytes (D001402); Metabolism, Inborn Errors (D008661); Creatinine (D003404); Hypersensitivity (D006967); Bone Diseases, Metabolic (D001851); Fungi (D005658); New England (D009511); Biliary Tract (D001659); Forecasting (D005544); Radiation (D011827)
OHSUMED-10B Thymus Gland (D013950); Insurance (D007341); Historical Geographic Locations (D017516); Leukocytes (D007962); Hemodynamics (D006439); Depression (D003863); Clinical Competence (D002983); Anti-Inflammatory Agents, Non-Steroidal (D000894); Cytophotometry (D003592); Hydroxy Acids (D006880)
OHSUMED-10C Endothelium, Vascular (D004730); Contraceptives, Oral, Hormonal (D003278); Acquired Immunodeficiency Syndrome (D000163); Gram-Positive Bacteria (D006094); Diarrhea (D003967); Embolism and Thrombosis (D016769); Health Behavior (D015438); Molecular Probes (D015335); Bone Diseases, Developmental (D001848); Referral and Consultation (D012017)
OHSUMED-10D Antineoplastic and Immunosuppressive Agents (D000973); Receptors, Antigen, T-Cell (D011948); Government (D006076); Arthritis, Rheumatoid (D001172); Animal Structures (D000825); Bandages (D001458); Italy (D007558); Investigative Techniques (D008919); Physical Sciences (D010811); Anthropology (D000883)
OHSUMED-10E HTLV-BLV Infections (D006800); Hemoglobinopathies (D006453); Vulvar Diseases (D014845); Polycyclic Hydrocarbons, Aromatic (D011084); Age Factors (D000367); Philosophy, Medical (D010686); Antigens, CD4 (D015704); Computing Methodologies (D003205); Islets of Langerhans (D007515); Regeneration (D012038)


Evgeniy Gabrilovich
gabr@cs.technion.ac.il

Last updated on April 21, 2006