Datasets
Links
AdEater data
http://www.cs.ucd.ie/staff/nick/home/research/ae/
AdEater is a program that learns to remove Internet advertisements. The machine learning dataset is available from this page.
Dataset generator
Datgen, formerly SCDS, is a computer program that generates data to systematically test programs that consume data. These synthetic datasets can be used to validate learning algorithms.
DELVE - Data for Evaluating Learning in Valid Experiments
http://www.cs.utoronto.ca/~delve/
Data for Evaluating Learning Valid Experiments: A standardized environment designed to evaluate the performance of methods that learn relationships based primarily on empirical data. Delve makes it possible for users to compare their learning methods with other methods on many datasets.
DNA microarray gene expression data
http://www.ebi.ac.uk/~brazma/Data-mining/microarray.html
A collection of public gene expression data sources maintained by A. Brazma.
Face recognition dataset
http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/ML94/face_homework.html
A dataset of face images for face recognition algorithms.
HS3D - Homo Sapiens Splice Sites Dataset
http://www.sci.unisannio.it/docenti/rampone/
HS3D (Homo Sapiens Splice Sites Dataset) is a database of Homo Sapiens Exon, Intron and Splice regions extracted from GenBank primate sequences Rel.123. The aim of this data set is to give standardized material to train and to assess the prediction accuracy of computational approaches for gene identification and characterization.
Learning Relational Concepts from Sensor Data of a Mobile Robot
http://www-ai.cs.uni-dortmund.de/FORSCHUNG/PROJEKTE/BLEARN2/data-sets.html
A set of data sets, where each data set is represented in first order logic. Maintained at the University of Dortmund, Germany.
National Space Science Data Center
Provides access to a wide variety of astrophysics, space physics, solar physics, lunar and planetary data from NASA space flight missions, in addition to selected other data and some models and software.
NIST Special Database 4.
http://www.nist.gov/srd/nistsd4.htm
This NIST database of fingerprint images contains 2000 8- bit gray scale fingerprint image pairs.
Penn Treebank Project
http://www.cis.upenn.edu/~treebank/
A corpus of parsed sentences. Used by many researchers for training data-driven parsing algorithms.

