首页 > IDC资讯 > IDC新闻

机器学习经典开源数据集

2018-06-13 来源：

0x00 前言

数据为王，使用相同机器学习算法，不同质量的数据能训练出不同效果的模型。本文将分享数据科学领域中经典的几个开源数据集。

正文分三部分：

详细介绍最常用的几个经典数据集
介绍如何使用 Python 优雅地观察数据集
其它开源数据集的获取方式

0x01 经典数据集

一、概述

下面表格中是居士整理的一些最常用的数据集，基本上能用于整个机器学习的过程中，这些数据集也频繁地出现在sklearn、spark ml、tenserfolw的官方示例中。

二、Iris

This is perhaps the best known database to be found in the pattern recognition literature. Fisher’s paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

Iris也称鸢尾花卉数据集，是一类多重变量分析的数据集。是由杰出的统计学家R.A.Fisher在20世纪30年代中期创建的，它被公认为用于数据挖掘的最著名的数据集。它包含3种植物种类(Iris setosa、Iris versicolor和Iris virginica)，每种各有50个样本。它由4个属性组成：sepal length(花萼长度)、sepal width(花萼宽度)、petal length(花瓣长度)和petal width(花瓣宽度)(单位是cm)。

三、Adult

Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)) Prediction task is to determine whether a person makes over 50K a year.

该数据从美国1994年人口普查数据库抽取而来，可以用来预测居民收入是否超过50K$/year。该数据集类变量为年收入是否超过50k$，属性变量包含年龄，工种，学历，职业，人种等重要信息，值得一提的是，14个属性变量中有7个类别型变量。

四、Wine

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. I think that the initial data set had around 30 variables, but for some reason I only have the 13 dimensional version. I had a list of what the 30 or so variables were, but a.) I lost it, and b.), I would not know which 13 variables are included in the set.

这份数据集包含来自3种不同起源的葡萄酒的共178条记录。13个属性是葡萄酒的13种化学成分。通过化学分析可以来推断葡萄酒的起源。值得一提的是所有属性变量都是连续变量。

五、20 Newsgroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

该数据集包含大约20000个新闻组文档，在20个不同的新闻组中平均分配，是一个文本分类的经典数据集，它是机器学习技术的文本应用中的实验的流行数据集，如文本分类和文本聚类。

六、MovieLens

MovieLens 数据集是一个关于电影评分的数据集，里面包含了从IMDB, The Movie DataBase上面得到的用户对电影的评分信息。该数据集可以用于推荐系统。

七、MNIST

MNIST数据集机器学习领域内用于手写字识别的数据集，数据集中包含6个万训练集、10000个示例测试集。，每个样本图像的宽高为28*28。这些数据集的大小已经归一化，并且形成固定大小，因此预处理工作基本已经完成。在机器学习中，主流的机器学习工具(包括sklearn)很多都使用该数据集作为入门级别的介绍和应用。

0x02 数据探索

关于数据的详细信息，对它最好的理解方式不是看文档，而是自己去看数据的分布和特性。

理解数据

在这里我们以鸢尾花数据集为例，使用Python的pandas来描述，关于鸢尾花数据集的获取，我们直接使用sklearn提供的api，不在自己下载。

1.数据获取和描述