深度学习数据集Deep Learning Datasets

转载

mob60475704a236 2016-03-01 20:08:00

Datasets

These datasets can be used for benchmarking deep learning algorithms:

Symbolic Music Datasets

Natural Images

MNIST: handwritten digits
NIST: similar to MNIST, but larger
Perturbed NIST: a dataset developed in Yoshua’s class (NIST with tons of deformations)
CIFAR10 / CIFAR100: 32×32 natural image dataset with 10/100 categories (
Caltech 101: pictures of objects belonging to 101 categories (
Caltech 256: pictures of objects belonging to 256 categories (
Caltech Silhouettes: 28×28 binary images contains silhouettes of the Caltech 101 dataset
STL-10 dataset is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. It is inspired by the CIFAR-10 dataset but with some modifications.
The Street View House Numbers (SVHN) Dataset -
NORB: binocular images of toy figurines under various illumination and pose (
Imagenet: image database organized according to the WordNethierarchy
Pascal VOC: various object recognition challenges
Labelme: A large dataset of annotated images,
COIL 20: different objects imaged at every angle in a 360 rotation(
COIL100: different objects imaged at every angle in a 360 rotation (

Artificial Datasets

Arcade Universe - An artificial dataset generator with images containing arcade games sprites such as tetris pentomino/tetromino objects. This generator is based on the O. Breleux’s bugland dataset generator.
A collection of datasets inspired by the ideas from BabyAISchool:
- distinguishing between 3 simple shapes
- a question-image-answer dataset
Datasets generated for the purpose of an empirical evaluation of deep architectures (
- : introducing controlled variations in MNIST
- : discriminating between wide and tall rectangles
- discriminating between convex and nonconvex shapes
- : controlling the degree of correlation in noisy MNIST backgrounds

Faces

TIMIT Speech Corpus: phoneme classification Aurora : Timit with noise and additional information

Recommendation Systems

MovieLens: Two datasets available from http://www.grouplens.org. The first dataset has 100,000 ratings for 1682 movies by 943 users, subdivided into five disjoint subsets. The second dataset has about 1 million ratings for 3900 movies by 6040 users.
Jester: This dataset contains 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users.
Netflix Prize: Netflix released an anonymised version of their movie rating dataset; it consists of 100 million ratings, done by 480,000 users who have rated between 1 and all of the 17,770 movies.
Book-Crossing dataset: This dataset is from the Book-Crossing community, and contains 278,858 users providing 1,149,780 ratings about 271,379 books.