Resources for Text, Speech and Language Processing

Pointers to Internet resources

Back to Resources

Bibliographies

Bibliography of constructive induction - feature engineering
Bibliography on Automated Text Categorization
Bibliography - Text Categorization
Automatic Text Processing related short bibliography
Feature Subset Selection Bibliography
Bibliography of NLP in Biomedicine
Lifelong learning, meta-learning
Spam Bibliography
Machine Learning Bibliographies
Machine Learning Applied to Text
Feature Selection
Computer Science Bibliographies
TDT Publications
Bibliography on Transformation-Based Learning

Back to top

Projects

Common Sense

Open Mind
OpenCyc
ThoughtTreasure home page
Cycorp

Companies and organizations

Electronic pocket talking dictionaries and translators
ARDA Home Page
Web Intelligence Consortium
AvaQuest, Inc. Resources - Categorization Vendors

Open source projects

Senga
The OpenNLP Homepage
Worldwide Lexicon
NLP Toolkit
POPFile Automatic Email Sorting using Naive Bayes
linguana
Morphix-NLP -- The most NLP application on one CD!
ZSoft platform-independent solutions for Data Mining

Spam mail

Email spam
A Plan for Spam
POPFile Automatic Email Sorting using Naive Bayes
Spammunition
Internet Content Filtering Group

Machine Learning Laboratory
Snowfox Home
Research proposal
Welcome to Cross Language Evaluation Forum
Text REtrieval Conference (TREC)
WebBase Project
Data Mining on the Web (mentions OpenDir)
WebKB
search.cpan.org Ken Williams - AI-Categorizer
WebKB@CMU
Interspace
Center for Automated Learning and Discovery
Columbia Newsblaster
Google Web APIs - Home
The Lemur Toolkit for Language Modeling and Information Retrieval
UNLP General Information
Text categorization using lexical chains
Kernel Methods for Image and Text
Natural Language Processing (NLP) at Cornell
The CAPTCHA Project
Demo of semantic word orientation

Back to top

Tools

SVM

LIBSVM
MATLAB Support Vector Machine Toolbox
SVM-Light Support Vector Machine
SvmFu Documentation
mySVM

Language Identification

Language Identification Tools
Stochastic Language Identifier
Language Identification
XRCE CA Language Identifier
Welcome to Inxight Software, Inc.
OEM Products Language & Character Encoding Identification
Automatic Language Identification Bibliography
RALI -- S I L C
Identification of Language and Character Encoding
Basis Technology's Products Rosette Language Identifier
TextCat Language Guesser

Stemming

Porter in Perl
Lovins
Snowball
Porter Stemming Algorithm

Part of Speech Tagging

MULTEXT
TnT - Statistical Part-of-Speech Tagging
QTag
Eric Brill's tagger
ePost - C++ wrapper of Brill's tagger

Text categorization

The Bow Toolkit
UDC in brief
Kea - automatic keyphrase extraction
BoosTexter
SNoW
LTG software LT TCR
S-EM download page Learning with Positive and Unlabeled Data
LPU download page

Machine Learning

C4.5 - C5.0

See5 An Informal Tutorial
RuleQuest Research Data Mining Tools
Ross Quinlan - AI Group, CSE

Weka 3
The SLIPPER Rule Learning System
The WHIRL System
DTREG -- Decision Tree Analysis Program
NLREG -- Nonlinear Regression Analysis Program
SGI - MLC++ Home Page
YALE - Yet Another Learning Environment

WordNet

EuroWordNet
The Global WordNet Association
WordNet
WordNet 1.6 Vocabulary Helper
WordNet in RDF
Wordnet Domains
Richard Lexicon Home
Demos

Roget's Thesaurus

Roget's Thesaurus as an Electronic Lexical Knowledge Base

LSA and HAL

LSI - Latent Semantic Indexing Web Site
Psycholinguistics and Computational Cognition Lab
Telcordia Latent Semantic Indexing (LSI) Demo Machine
LSA @ CU Boulder
Introduction to LSI

Hubs

CMU AI Repository - NLP
NL Software Registry @ DFKI
Resources
Software Tools for NLP
Speech and Language Web Resources
The Data Warehousing Information Center - Text Mining Tools
Welcome to Cognitive Computation

Sentence boundary detection

SATZ - Sentence boundary detector
MXTERMINATOR
search.cpan.org Tony G. Rose - HTML-Summary-0.017
LTG software LT TTT
Adwait Ratnaparkhi Stat NLP
Automatic English Sentence Segmenter
LinguaENSentence - Splitting text into sentences.
Sentencizers

XML parsers

expat
Xerces C++ Parser

Open directory

Yahoo

About Yahoo

Open Directory - Use of ODP Data
Web Directory Sizes
ODP and Yahoo Size Projection Charts

Semantic metrics

Dekang Lin - semantic metrics
search.cpan.org Siddharth Patwardhan - WordNet-Similarity-0.03

NL parsing

Minipar
Link Grammar Parser
Apple Pie Parser
Conexor Analyzers

Misc text analysis tools

LT Group - Edinburgh
Infogistics Text Analysis tools
Senga
fnTBL Toolkit - Home
WordStat
SRI Language Modeling Toolkit
Textomy - tooks for text dissection

Text summarization

Copernic Summarizer - Product Overview
search.cpan.org HTMLSummary - module for generating a summary from a web page.

HTML parsers

Clean up your Web pages with HTML TIDY
HTML Tidy Project Page

Named Entity Recognition

Language-Independent Named Entity Recognition

AI Search

Local++ Project Home Page
AI C++ Search Class Library

Math

Netlib
TNT Home Page
GAMS - Guide to Available Mathematical Software
Critical t Values
Peter Hellekalek pLab Software
Pseudo random number generators

C++

STL Guide at SGI
STLport
Boost
STL Error Decryptor

Scripting

Rob van der Woude's Scripting Pages Batch Files
Sample Win9x Batch Programs

GSview
Introduction to GnuPlot

Back to top

Misc

Search engines

Notess.com_ The Greg Notess Web Site
Search Engine Watch
Search Tools - Information, Guides and News
Finding Information on the Internet A TUTORIAL
Search tools
Web Search @ About.com
The Internet Archive Wayback Machine
Searchengines.Ru
Search Engine Showdown
Teoma Search -- Search with Authority
KartOO
On Search, the Series

Speech Processing

Speech Recognition Update
Speech Technology Magazine online
Speechtechnology Network
Compaq.com - SpeechBot
Biometric Consortium

Book publishers

MIT Press
Addison-Wesley
Prentice Hall
W.H.Freeman and Company
Cambridge University Press
Academic Press
Kluwer Academic Publishers
Oxford University Press
The University of Chicago Press
Elsevier
John Wiley and Sons
O'Reilly and Associates
McGraw-Hill Book Company
Mcmillan Computer Reference

Mailing lists

TREC filtering
Corpora
Colibri
Elsnet list
Linguist
Search Engine Report
Connectionists
WebIR

Back to top

Corpora and lexicons

Hubs

SIGLEX Resources
Corpus Linguistics
English language corpora
Linguistic Data Resources on the Internet
The ACL NLP-CL Universe
W3-Corpora List of Corpora
BNC English Language Corpora and Corpus resources
David Lee's Bookmarks for Corpus-based Linguists

Online books and texts

Project Gutenberg
Electronic Text Center -- University of Virginia
The Online Books Page

RCV1

Reuters Research and Standards Group - Corpus
RCV1

Reuters-21578

Reuters-21578 Text Categorization Collection
Reuters-21578 Text Categorization Test Collection
Tools for Reuters-21578 Text Categorization Dataset

OHSUMED

Files Available to Download or View
Medical Subject Headings (MeSH)
OHSUMED (FTP)

American National Corpus
Novelty and Redundancy Detection for Adaptive Filtering DataSet
Glasgow IDOM - Test collections
ICAME
The BNC Handbook
LDC - Linguistic Data Consortium
The ELRA home page
The Oxford Text Archive
WIPO automated categorization datasets
Web Term Document Frequency Form
OPUS - an open source parallel corpus
Collocational Dictionary (ARCS)
The Moby Project
The TREC-AP Text Categorization Test Collection
Words and Phrases from the British National Corpus
Free Association Norms
Longman Dictionaries for Research (LDOCE)
Movie Review Data

Back to top

Scientific search

NCSTRL Home Page
Computer Science e-Print Archive
Cora Research Paper Search
IEEE Xplore
ResearchIndex (NEC)
Welcome to the ACM Digital Library
Welcome to IEEE Transactions & Journals
Scirus - Searching for Science
Unified Computer Science TR Index (UCSTRI)
search4science
Computation and Language - ISRAEL Mirror
Other Lists of Bibliographies
Computer Science Bibliography Glimpse Server
Cornell Computer Science Technical Reports
NASA Technical Report Server (NTRS)
Papers database main page
Technical Reports - NASA LaRC Technical Library

Back to top

Online publications

Journals

Journal of Artificial Intelligence Research
Journal of Machine Learning Research
Journal of Intelligent Information Systems
TAL journal - Association pour le Traitement Automatique des LAngues

Conferences

VLDB Endowment Inc.

Books and reports

Foundations of Statistical Natural Language Processing
Survey of the State of the Art in Human Language Technology
Pattern Classification - Duda, Hart, Stork
Generalized Information Measures and their Applications
Managing Gigabytes
Numerical Recipes
Data-Intensive Linguistics

ACL Anthology

Back to top

Hubs on NLP, IR, ML etc

ELSnet Homepage
fabulousness - linguistics and stuff
Information Retrieval Links
Fieldmethods.net
Linguistic Resources on the Internet
Speech and Language Web Resources
Boosting Research Site Boosting.org
Survey of Information Retrieval
The Association for Computational Linguistics
The LINGUIST List
COLT Computational Learning Theory
Pattern Recognition on the Web
Statistical NLP - corpus-based resources
The ELRA home page
KDnuggets Data Mining, Web Mining, and Knowledge Discovery Guide
Information Filtering Resources
MLnet OiS - Machine Learning, Knowledge Discovery, Data Mining, Case-based Reasoning, and Knowledge Acquisition
Glasgow IDOM - IR resources
Weblog of computational linguistics
WebIR
ACL SIG on Natural Language Learning (SIGNLL)
COLE sites about Computational Linguistics
EACL
HLT Home

Back to top

LaTeX

Tutorials

Advanced LaTeX
LaTeX- from quick and dirty to style and finesse

Reference

LaTeX2e Help file
Help on LaTeX commands
The LaTeX Encycolpedia
Math Symbols in LaTeX
LATEX maths and graphics
The Technion Guide to LATEX2e

Usage

CTAN LaTeX Archive
The TeX Catalog Online
TeX Users Group Home Page

Back to top


Evgeniy Gabrilovich
gabr@cs.technion.ac.il

Last updated on November 30, 2011


Keywords: Computational Linguistics, Natural Language Processing, NLP, Natural Language Understanding, Natural Language Analysis, Natural Language Generation, Information Retrieval, IR, Text Categorization, Artificial Intelligence, AI, Machine Learning, Corpus Linguistics, Algorithm Design, Text Mining, Text Data Mining, Digital Signal Processing, DSP, Speech Processing, Speech Recognition, SR, Automatic Speaker Recognition, ASR, Speaker Identification, Speaker Verification