Resources for Text, Speech and Language Processing

Tagged datasets for named entity recognition tasks

  1. 1999 Information Extraction – Entity Recognition Evaluation
    Notes: This dataset is apparently in public domain.
  2. MUC-3 and MUC-4 datasets
    Notes: This dataset is apparently in public domain.
  3. Language-Independent Named Entity Recognition at CoNLL-2003
    Notes: This dataset is a manual annotatation of a subset of RCV1 (Reuters Corpus Volume 1). The annotation per se is available free of charge (subject to a licensing agreement) from the CoNLL site. The raw text of RCV1 documents must be requested from NIST (also free of charge and also subject to a licensing agreement).
  4. Message Understanding Conference (MUC) 6
    Notes: Consult the LDC Web site for current pricing and usage agreement.
  5. Message Understanding Conference (MUC) 6 Additional News Text
    Notes: Consult the LDC Web site for current pricing and usage agreement.
  6. Message Understanding Conference (MUC) 7
    Notes: Consult the LDC Web site for current pricing and usage agreement.
  7. ACE-2 Version 1.0
    Notes: Consult the LDC Web site for current pricing and usage agreement.
  8. TIDES Extraction (ACE) 2003 Multilingual Training Data
    Notes: Consult the LDC Web site for current pricing and usage agreement.
  9. ACE 2004 Multilingual Training Corpus
    Notes: Consult the LDC Web site for current pricing and usage agreement.
  10. Name-Annotated TDT Corpus Supplement for ACE
    Notes: Consult the LDC Web site for current pricing and usage agreement.
  11. Enron Email Dataset
    Notes: Email messages in this corpus are tagged with person names, dates and times.
  12. A variety of biomedical corpora
    Notes: Some corpora in this collection are tagged with entities in the biomedical domain, such as gene names.
  13. Automatic Content Extraction (ACE)
    Notes: Homepage of the ACE program.
Back to top

Evgeniy Gabrilovich
gabr@cs.technion.ac.il

Last updated on July 28, 2006


Keywords: Computational Linguistics, Natural Language Processing, NLP, Natural Language Understanding, Natural Language Analysis, Natural Language Generation, Information Retrieval, IR, Artificial Intelligence, AI, Machine Learning, Corpus Linguistics, Algorithm Design, Text Mining, Text Data Mining, Name Entity Recognition, Disambiguation