Machine Learning Algorithms
上QQ阅读APP看书,第一时间看更新

scikit-learn toy datasets

scikit-learn provides some built-in datasets that can be used for testing purposes. They're all available in the package sklearn.datasets and have a common structure: the data instance variable contains the whole input set X while target contains the labels for classification or target values for regression. For example, considering the Boston house pricing dataset (used for regression), we have:

from sklearn.datasets import load_boston

>>> boston = load_boston()
>>> X = boston.data
>>> Y = boston.target

>>> X.shape
(506, 13)
>>> Y.shape
(506,)

In this case, we have 506 samples with 13 features and a single target value. In this book, we're going to use it for regressions and the MNIST handwritten digit dataset (load_digits()) for classification tasks. scikit-learn also provides functions for creating dummy datasets from scratch: make_classification(), make_regression(), and make_blobs() (particularly useful for testing cluster algorithms). They're very easy to use and in many cases, it's the best choice to test a model without loading more complex datasets.

Visit http://scikit-learn.org/stable/datasets/ for further information.
The MNIST dataset provided by scikit-learn is limited for obvious reasons. If you want to experiment with the original one, refer to the website managed by Y. LeCun, C. Cortes, C. Burges: http://yann.lecun.com/exdb/mnist/. Here you can download a full version made up of 70,000 handwritten digits already split into training and test sets.