These datasets can be used for benchmarking deep learning algorithms:
Symbolic Music Datasets
- Piano-midi.de: classical piano pieces
- Nottingham : over 1000 folk tunes
- MuseData: electronic library of classical music scores
- JSB Chorales: set of four-part harmonized chorales
Natural Images
- MNIST: handwritten digits
- NIST: similar to MNIST, but larger
- Perturbed NIST: a dataset developed in Yoshua’s class (NIST with tons of deformations)
- CIFAR10 / CIFAR100: 32×32 natural image dataset with 10/100 categories (
- Caltech 101: pictures of objects belonging to 101 categories (
- Caltech 256: pictures of objects belonging to 256 categories (
- Caltech Silhouettes: 28×28 binary images contains silhouettes of the Caltech 101 dataset
- STL-10 dataset is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. It is inspired by the CIFAR-10 dataset but with some modifications.
- The Street View House Numbers (SVHN) Dataset -
- NORB: binocular images of toy figurines under various illumination and pose (
- Imagenet: image database organized according to the WordNethierarchy
- Pascal VOC: various object recognition challenges
- Labelme: A large dataset of annotated images,
- COIL 20: different objects imaged at every angle in a 360 rotation(
- COIL100: different objects imaged at every angle in a 360 rotation (
Artificial Datasets
- Arcade Universe - An artificial dataset generator with images containing arcade games sprites such as tetris pentomino/tetromino objects. This generator is based on the O. Breleux’s bugland dataset generator.
- A collection of datasets inspired by the ideas from BabyAISchool:
- distinguishing between 3 simple shapes
- a question-image-answer dataset
- Datasets generated for the purpose of an empirical evaluation of deep architectures (
- : introducing controlled variations in MNIST
- : discriminating between wide and tall rectangles
- discriminating between convex and nonconvex shapes
- : controlling the degree of correlation in noisy MNIST backgrounds
Faces
Recommendation Systems
- MovieLens: Two datasets available from http://www.grouplens.org. The first dataset has 100,000 ratings for 1682 movies by 943 users, subdivided into five disjoint subsets. The second dataset has about 1 million ratings for 3900 movies by 6040 users.
- Jester: This dataset contains 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users.
- Netflix Prize: Netflix released an anonymised version of their movie rating dataset; it consists of 100 million ratings, done by 480,000 users who have rated between 1 and all of the 17,770 movies.
- Book-Crossing dataset: This dataset is from the Book-Crossing community, and contains 278,858 users providing 1,149,780 ratings about 271,379 books.