Hands On Machine Learning with Scikit Learn and TensorFlow(第三章)

MNIST 从sklearn自带函数中导入NMIST第一次导入可能会出错，从这里下载https://github.com/amplab/datascience-sp14/blob/master/lab7/mldata/mnist-original.mat 放入mldata这个文件可解决。from sklearn.datasets import fetch_...

_Gus_

835人浏览 · 2018-07-08 21:35:08

_Gus_ · 2018-07-08 21:35:08 发布

MNIST

从sklearn自带函数中导入NMIST

第一次导入可能会出错，从这里下载https://github.com/amplab/datascience-sp14/blob/master/lab7/mldata/mnist-original.mat 放入mldata这个文件可解决。

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original',data_home='H:\paper\DeepLearning\Tensorflow\hand on with tensorflow')

出现的结果如下

取出数据并查看数据的维度

X, y = mnist["data"], mnist["target"]
X.shape
y.shape

有70000张图片，每张图片有784个特征点（28X28），每个特征点代表像素值的亮度，从0（白）到255（黑），

使用imshow()显示一张图片。

import matplotlib
import matplotlib.pyplot as plt
some_digit = X[36000]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap = matplotlib.cm.binary,
interpolation="nearest")
plt.axis("off")
plt.show()

NMIST已经把前60000张图片分为训练集，后10000图片分为测试集。

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

随后shuffle训练集，不希望所有的交叉验证集类似（因为不希望某一个交叉验证集缺失了某些数据）。并且某些算法对训练集的顺序很敏感，会使得效果变差。

import numpy as np
shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

Training a Binary Classifer

简化问题，变为二分类问题。

y_train_5 = (y_train == 5) # True for all 5s, False for all other digits.
y_test_5 = (y_test == 5)

使用SGDClassifier类创建实例

from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)

预测是否为数字5

sgd_clf.predict([some_digit])

Performance Measures

Measuring Accuracy Using Cross-Validation

有时候需要自己写交叉验证集，下面的函数和cross_val_score() 干同样的事情（结果一模一样）。StratifiedKFold 进行分层抽样，每一个fold中的每一个类别都具有代表性。n_splits=3表示交叉验证3次。

from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
skfolds = StratifiedKFold(n_splits=3, random_state=42) 
for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = y_train_5[train_index]
    X_test_fold = X_train[test_index]
    y_test_fold = y_train_5[test_index]
    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct / len(y_pred)) # prints 0.9502, 0.96565 and 0.96495

或者使用cross_val_score() 来评估SGDClassifier 模型，

 from sklearn.model_selection import cross_val_score
>>> cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

把训练集分成3份folds，对每一份fold进行预测，进行预测的模型是用其他2份fold训练的。

from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

可以看到accuracy为95%，下面代码看出，即使全判断非5，正确率也可以为90，因为数据集中5大概出现10%，全判断为不是5，accuracy也有90%。

from sklearn.base import BaseEstimator
class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)
never_5_clf = Never5Classifier()
c=cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")

证明精确度（accuracy）不是一种好选择，特别是在某一份数据中，某些类别出现的很频繁。

Confusion Matrix

通常想法是计算实例A归为实例B的次数，例如分类器把图像5错误的归类为图像3，那么可以寻找confusion matrix的5行3列。

from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

cross_val_predict() 直接执行K-fold cross-validation 并直接给出每一个test fold上的预测。预测X_train上的每一个实例，针对此实例的预测都是clean的，clean是指在训练过程中模型没有看过此数据，更加验证了交叉验证集，1份放旁边，先用2份训练，然后预测这一份。使用confusion_matrix() 得到confusion 矩阵

from sklearn.metrics import confusion_matrix
ConfusionMatrix=confusion_matrix(y_train_5, y_train_pred)

矩阵第一行考虑负类（negative class），这里的负类的意思是：它们本就应该属于非5，有54402个正确，177个错误。54402个训练例子被正确归结非5，他们又被叫做正确的负类（true negatives）,177个训练例子被错误的归为5（false positives）。第二行考虑正类（positive class），他们本就应该属于正类，出于一些原因，2506个例子被错误划分为非5.2506个训练例子被错误的归为非5（false negatives），2915个训练例子被正确归为5（true positives），

完全的情况应该是这样，如下图，除了主对角线以外都为0.

有时，需要更简洁的表示。我们把正类的accuracy,称为这个鉴别器的precision

TP代表true positives ，FP为false positives ，对应confusion matrix中就是第二列的比值\[\frac{{{\rm{2915}}}}{{{\rm{177 + 2915}}}}\]

还有一个概念为recall，有叫sensitivity 或者true positive rate(TPR)

\[\frac{{TP}}{{TP + FN}}\]
还是按照confusion matrix中的数字，recall为

\[\frac{{{\rm{2915}}}}{{{\rm{2915 + 2506}}}}\]

precision和recall

Precision and Recall

scikit-learn中拥有函数计算precision和recall，

from sklearn.metrics import precision_score, recall_score
PrecisionScore=precision_score(y_train_5, y_train_pred) # == 4591 / (4591 + 1716)
RecallScore=recall_score(y_train_5, y_train_pred) # == 4591 / (4591 + 830)

可以看出，在预测结果为5的情况下（正确预测为5，把非5预测为5），72%预测对了。在所有真实标签为5的情况下，有84%的5被取出了，被召回了。

把recall和precision结合，得到F1 score,为recall和precision的harmonic mean。给小的值很大的比重。

使用f1_score(）计算

from sklearn.metrics import f1_score
F1Score=f1_score(y_train_5, y_train_pred)

f1score的值越高，那么recall和precision的值就越接近。有时候需要低的recall和高的precision。例如有一个鉴别器用来鉴别视频的内容是否对儿童安全，那么就需要低的recall和高的precision.意思是：在TP，也就是安全的视频能被鉴定出来的前提下，FN要越大越好，FN越大也就是正的样本越多，造成recall的减少。并且还要使得precison越高，也就是FP要越低（不安全的被判断为安全的数量要减少）。还有一个例子，鉴别商店扒手，在recall为99%，precision为30情况下，（虽然会造成把不是扒手的人鉴定为扒手，几率为30，但是把扒手鉴定为顾客的几率为99），这样就不会造成漏抓。

Precision/Recall Tradeoﬀ

分类器中有一个函数decison_function，分类器基于函数计算一个分数，分数高于threshold那么就判断为5，低于则判断为非5.

可以看出把threshold右移，虽然precision为100，但是recall只有50。反之也一样。

可以不使用predict()来预测，直接使用decison_function函数来获得分数，然后判断此分数和threshold的大小，大于threshold则判断为正类。把threshold设为0，则和predict()效果一样。

y_scores = sgd_clf.decision_function([some_digit])
threshold = 0
y_some_digit_pred = (y_scores > threshold)

使用交叉验证的方式给出每一个训练集的分数

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
method="decision_function")

使用precision_recall_curve() 给出每一种threshold的情况下的recall和precision.

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores[:,1])

最后画图

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.xlabel("Threshold")
    plt.legend(loc="center left")
    plt.ylim([0, 1])
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()

这里recall[:-1]的意思要和前面threshold的维度匹配，去掉recall的最后一维。

提高threshold，precision可能会升高，也有可能会降低。而降低threshold,recall肯定是升高的。

也可以这么画图

def plott_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(recalls, precisions, "b--")
    plt.xlabel("recall")
    plt.ylabel("precision")
    
    plt.ylim([0, 1])
    plt.xlim([0, 1])
plott_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()

可以看出在80%的recall的时候precision就急剧下降。为此我们可以选择recall为60%，这样也不至于让precision下降的太厉害。如果你想要precision为90%，观察图可知，threshold大约在70000，使用如下代码得。

y_train_pred_90 = (y_scores > 70000)

PrecisionScore=precision_score(y_train_5, y_train_pred_90)
RecallScore=recall_score(y_train_5, y_train_pred_90)

可以随意得到自己想要得precision和recall

The ROC Curve

receiver operating characteristic (ROC) 曲线指的y坐标为true positive rate（TPR，同样也是recall），x坐标为false positive rate(FPR)得曲线，,FPR=1-specificity。specificity为right negative rate TNR（所有负例子中，正确归结为负例子的比率）。

使用roc_curve() 函数计算多种threshold下的FPR,TPR

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores[:,1])

随后画图

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
plot_roc_curve(fpr, tpr)
plt.show()

可以看到FPR越高，那么TPR也就越高。中间的虚线代表一个随机的分类器，实际情况中需要越远离这条线越好（越上越好）。
评测分类器的指标可以是area under the curve (AUC) ，一个完美的分类器的 ROC AUC为1，完全随机的分类器ROC,AUC为0.5.SKlearn给出了函数。

from sklearn.metrics import roc_auc_score
rocAucScore=roc_auc_score(y_train_5, y_scores[:,1])

(precision/recall curve) PR curve和ROC的比较。当正类比较少，或者是更关注false positives时，选择PR curve。

例如从上图看 corAucScore分数比较高，这只是因为训练集中的正类（5）比较少（相比于非5），其实还有改进空间。

使用随机森林代替SGD，

from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,
                                    method="predict_proba")

y_probas_forest和SGD类的分数不同，这里面的是属于某一类的几率。

每一行都是一个例子，如例子2，由0.7的概率为正类1.使用判断为正类的几率当作分数来作图。

y_scores_forest = y_probas_forest[:, 1] # score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5,y_scores_forest)
plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
plt.show()

Multiclass Classifcation

随机森林分类器和朴素贝叶斯可以直接处理多分类问题。支持向量机和线性分类器可以处理二分类问题。或者用多个二分类器来处理多分类。

one-versus-all (OvA) strategy (also called one-versus-the-rest) ：假设判断0~9，训练10个二分类器，把一张图片放入这10个分类器中，挑出得分最大的分类器，那么就是图中的数字。

one-versus-one (OvO) ：训练多个二分类器区别是0还是1，是1还是2，...是8还是9.类别数量为N，总的训练器个数为N × (N – 1) / 2

实际使用SGD进行训练时，内部是采用OVA实现的

sgd_clf.fit(X_train, y_train) # y_train, not y_train_5
sgd_clf.predict([some_digit])

查看每个例子的分数，实际上分数最大的类别就属于5.

some_digit_scores = sgd_clf.decision_function([some_digit])

print(np.argmax(some_digit_scores))
print(sgd_clf.classes_)
print(sgd_clf.classes_[5])

使用OVO策略，一共相当于训练了45个二分类器

from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=42))
ovo_clf.fit(X_train, y_train)
print(ovo_clf.predict([some_digit]))
print(len(ovo_clf.estimators_))

使用随机森林可以直接预测多分类问题

forest_clf.fit(X_train, y_train)
forest_clf.predict([some_digit])
forest_clf.predict_proba([some_digit])

可以直接判断是数字5

直接评估

print(cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy"))

[ 0.85067986  0.79068953  0.85227784]

假设对训练数据使用标准化，那么会提高准确率

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
print(cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy"))

[ 0.91041792  0.91169558  0.90528579]

Error Analysis

分析整体的误差。使用confusion_matrix

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)

使用图形化来表示confusion matrix

plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()

大多数图片在对角线上，表示他们被正确识别了，5看起来比其他的要暗。要么是因为5在数据集中占的比重少，要么是因为5被正确识别的少。其实在这个例子中都有。白色代表数量大，越黑代表数量越少。

每一行代表一个类的总数，每一行除总数得到比率。

row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums

现在我们只关心误差，所以把对角线涂黑。

np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()

其中每一行代表actual的类别，列代表predicted的类别。可以看出，第8列和第9列中有很多白块，说明很多类别被误判为8和9。第8行也有很多白块，说明8也被误判成其他的类别。其中第0行大部分都是黑色，说明这个类别被正常判断了。分类器的误判不是对称的，类如很多5会被误判为8，但是反过来不一定成立。

这种图可以给你一种直觉，帮助你判断此分类器哪里需要改善。例如，多收集8，9图片（常常被误判），或者手动编写一个算法，此算法计算闭合的圈圈，8拥有2个圈，6拥有一个圈，5没有圈，也可以通过其他软件，如Scikit-Image, Pillow, or OpenCV
来凸出某些模式，例如上面所说的圈。

通过查看分类器误判的图片可以查看分类器效果差的原因，查看把3误判称5的图片，5误判成3的图片。

def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1  ##10//3=3
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))##用空白矩阵填充
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row] ##list前10个元素
        row_images.append(np.concatenate(rimages, axis=1)) ##28x280
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = matplotlib.cm.binary, **options)
    plt.axis("off")
cl_a, cl_b = 3, 5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]
plt.figure(figsize=(8,8))
plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5)
plt.subplot(223); plot_digits(X_ba[:25], images_per_row=5)
plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5)
plt.show()

函数说明以X_ab为例，

上面代码只是使用SGDClassifier，它只是一个线性模型，单纯的给每一个像素的强度一个权重，然后分类的时候把所以经过权重的像素点相加得到分数，挑选一个分数最高的类别作为预测类。要使得模型区分开3和5，要进行预处理，使3居中，并且不要歪的太厉害。

Multilabel Classifcation

识别一张图中的3个人,Alice, Bob, and Charlie.假如图中只有Alice，和Charlie，没有Bob，则输出[1, 0, 1] 。

以下代码首先建立一个标签，第一个标签代表这个图片大于等于7，第二个标签代表这个图片是奇数。随后建立一个KNeighborsClassifier() 实例（不是所有的分类器都可以鉴定多目标）。

y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)

随后进行预测数字5.

print(knn_clf.predict([some_digit]))

用F1 score来评测这个分类器。

y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
F1ScoreMul=f1_score(y_multilabel, y_train_knn_pred, average="macro")

这时我们默认y_multilabel的每一个标签的重要性是一样的，假如在带有标签的训练集中，Alice图片的数量比Bob or Charlie 都要多，那么我们给每一个标签(laber)都加上一个权重，这个权重等于带有目标标签的实例数量。为了达到这个，只需要把上面的代码的average="weighted" 。

Multioutput Classifcation

输入一个有噪声的图片，输出一张无噪声的数字图，每一个像素点都是一个标签（label），这个标签可以有多个数值（0~255）。

首先为数字图片加上噪声

noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = np.random.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise
y_train_mod = X_train
y_test_mod = X_test

左边为有噪声的图片，右边是没有噪声的图片

随后进行判断

knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[some_index]])
plot_digit(clean_digit)

AtomGit 开源协作平台测评赛

瓜分20万奖金获得内推名额丰厚实物奖励易参与易上手

更多推荐

mac 使用brew卸载安装node

mac 使用brew卸载安装node卸载1. 查看当前安装的node版本：node -v2. 卸载node：brew uninstall node@版本号 --force比如安装的是12.18.1，使用brew uninstall node@12 --force。还有另外两种现在不能用的方法：使用brew uninstall node，会报错：Error: No such keg: /usr/lo