Datasets


These datasets can be used for benchmarking deep learning algorithms:

Symbolic Music Datasets




  • Piano-midi.de: classical piano pieces
  • Nottingham : over 1000 folk tunes
  • MuseData: electronic library of classical music scores​​​
  • JSB Chorales: set of four-part harmonized chorales

Natural Images



  • MNIST: handwritten digits
  • NIST: similar to MNIST, but larger
  • Perturbed NIST: a dataset developed in Yoshua’s class (NIST with tons of deformations)
  • CIFAR10 / CIFAR100: 32×32 natural image dataset with 10/100 categories (​​​
  • Caltech 101: pictures of objects belonging to 101 categories (​​​
  • Caltech 256: pictures of objects belonging to 256 categories (​​​
  • Caltech Silhouettes: 28×28 binary images contains silhouettes of the Caltech 101 dataset
  • STL-10 dataset is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. It is inspired by the CIFAR-10 dataset​ but with some modifications.
  • The Street View House Numbers (SVHN) Dataset -
  • NORB: binocular images of toy figurines under various illumination and pose (​​​
  • Imagenet: image database organized according to the WordNethierarchy
  • Pascal VOC: various object recognition challenges
  • Labelme: A large dataset of annotated images, ​​​
  • COIL 20: different objects imaged at every angle in a 360 rotation(​​​
  • COIL100: different objects imaged at every angle in a 360 rotation (​​​

Artificial Datasets



  • Arcade Universe ​- An artificial dataset generator with images containing arcade games sprites such as tetris pentomino/tetromino objects. This generator is based on the O. Breleux’s bugland dataset generator.
  • A collection of datasets inspired by the ideas from BabyAISchool:

    • distinguishing between 3 simple shapes
    • a question-image-answer dataset

  • Datasets generated for the purpose of an empirical evaluation of deep architectures (​​​

    • : introducing controlled variations in MNIST
    • : discriminating between wide and tall rectangles
    • discriminating between convex and nonconvex shapes
    • : controlling the degree of correlation in noisy MNIST backgrounds



Faces









TIMIT Speech Corpus: phoneme classification Aurora : Timit with noise and additional information

Recommendation Systems



  • MovieLens: Two datasets available from http://www.grouplens.org. The first dataset has 100,000 ratings for 1682 movies by 943 users, subdivided into five disjoint subsets. The second dataset has about 1 million ratings for 3900 movies by 6040 users.
  • Jester: This ​​dataset​​ contains 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users.
  • Netflix Prize: Netflix released an anonymised version of their movie rating ​​dataset​​; it consists of 100 million ratings, done by 480,000 users who have rated between 1 and all of the 17,770 movies.
  • Book-Crossing dataset: This ​​dataset​​ is from the Book-Crossing community, and contains 278,858 users providing 1,149,780 ratings about 271,379 books.