Datasets
Links
WordSimilarity-353 Test Collection
http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.html
Contains 353 English word pairs along with human-assigned similarity judgements.
RISE: Repository of Information Sources used in information Extraction tasks.
http://www.isi.edu/info-agents/RISE/
Repository of online information sources: test domains for information extraction and wrapper generation tools that learn extraction rules (extraction patterns).
Reuters-21578 Text Categorization Corpus
http://www.daviddlewis.com/resources/testcollections/reuters21578/
A classic benchmark for text categorization algorithms.
DNA microarray gene expression data
http://www.ebi.ac.uk/~brazma/Data-mining/microarray.html
A collection of public gene expression data sources maintained by A. Brazma.


