The WordSimilarity-353 Test Collection

Version: 1.0
Release date: February 10, 2002
Maintained by: Evgeniy Gabrilovich (gabr@cs.technion.ac.il)

Overview

The WordSimilarity-353 Test Collection contains two sets of English word pairs along with human-assigned similarity judgements. The collection can be used to train and/or test computer algorithms implementing semantic similarity measures (i.e., algorithms that numerically estimate similarity of natural language words).

Description

The first set (set1) contains 153 word pairs along with their similarity scores assigned by 13 subjects. The second set (set2) contains 200 word pairs, with their similarity assessed by 16 subjects. Subjects' names have been replaced by ordinal numbers (1..13, or 1..16) to protect their privacy; identical numbers in the two sets do not necessarily correspond to the same individual.

All the subjects in both experiments possessed near-native command of English. Their instructions were to estimate the relatedness of the words in pairs on a scale from 0 (totally unrelated words) to 10 (very much related or identical words). The precise instructions are available in file instructions.txt inside the ZIP archive (see section "Availability and usage" below).

Each set provides the raw scores assigned by each subject, as well as the mean score for each word pair. For convenience, a combined set (combined) is provided that contains a list of all 353 words, along with their mean similarity scores. The combined set is merely a concatenation of the two smaller sets.

All sets (set1, set2 and combined) are available in two formats:

Comma-separated values (CSV) - see files with the csv extension
Tab-delimited (TAB) - see files with the tab extension

The first two columns in each file contain word pairs, followed by a column with the (floating-point) mean score of the subjects' individual assessments. In set1 and set2 there are additional columns with individual subjects' scores (one column per subject). In the general case, all scores are floating-point, although many appear as integers.

Note: set1 includes, among others, all the 30 noun pairs from G.A. Miller and W.G. Charles, "Contextual correlates of semantic similarity", Language and Cognitive Processes, Vol. 6, No. 1, 1991, pp. 1-28 (although similarity scores have been obtained anew).

Availability and usage

Download the data set as a ZIP file:

wordsim353.zip (23Kb; 53Kb uncompressed).

If you publish results based on this data set, please cite as

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin, "Placing Search in Context: The Concept Revisited", ACM Transactions on Information Systems, 20(1):116-131, January 2002 [Abstract / PDF]

Please also inform your readers of the current location of the data set:
http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.html

Additional distributions

Eneko Agirre et al. proposed to split the WordSimilarity-353 collection into two datasets, one focused on measuring similarity, and the other one on relatedness. The data is available here: http://alfonseca.org/eng/research/wordsim353.html

Questions ?

If you have questions or comments, please email me at gabr@cs.technion.ac.il.

References

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin
"Placing Search in Context: The Concept Revisited"
ACM Transactions on Information Systems, 20(1):116-131, January 2002
[Abstract / PDF]
Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin
"Placing Search in Context: The Concept Revisited"
The Tenth International World Wide Web Conference (WWW10), pp. 406-414, Hong Kong, May 2001, ACM Press
[Abstract / PDF]

Additional publications

If you are using the WordSimilarity-353 Test Collection and want your article(s) listed here, please email me at gabr@cs.technion.ac.il.

Evgeniy Gabrilovich and Shaul Markovitch
"Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis"
Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, January 2007
Michael Strube and Simone Paolo Ponzetto
"WikiRelate! Computing Semantic Relatedness Using Wikipedia"
Proceedings of The 21st National Conference on Artificial Intelligence (AAAI), Boston, MA, July 2006
Mario Jarmasz
"Roget's Thesaurus as a Lexical Resource for Natural Language Processing"
M.Sc. Thesis, School of Information Technology and Engineering, University of Ottawa, Canada, July 2003
Douglas L.T. Rohde, Laura M. Gonnerman, and David C. Plaut
"An Improved Method for Deriving Word Meaning from Lexical Co-Occurrence"
In preparation
James Richard Curran
"From Distributional to Semantic Similarity"
Ph.D. Thesis, Institute for Communicating and Collaborative Systems, School of Informatics, University of Edinburgh, 2003
Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, Aitor Soroa
A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches, Proceedings of NAACL-HLT 2009.
Your paper here ...

Other word similarity resources

Latent Semantic Analysis (LSA) [aka Latent Semantic Indexing (LSI)]
Hyperspace Analog to Language (HAL)
- Psycholinguistics and Computational Cognition Lab
WordNet
Dekang Lin's semantic metrics: dependency-based and proximity-based word similarity
- Demos
- Downloads
Peter Turney's unsupervised learning algorithm for recognizing synonyms
- Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
- Online demo

Evgeniy Gabrilovich
gabr@cs.technion.ac.il

Last updated on October 4, 2006