模式识别在生物信息领域的应用实例

模式识别在生物信息领域的应用实例一、问题需求细胞是动植物的结构和功能的基本单位。人体大约由几十万亿个细胞组成，根据细胞的形态、功能等可以把细胞分为众多不同的类型，如生殖细胞、神经细胞等等。研究发现，即使是同一类的细胞，在形态、基因表达等方面仍存在着差异。为了探究细胞与细胞之间的差异及其原因，单细胞测序技术应运而生。本次作业将使用单细胞基因表达数据对细胞分类问题进行探究。...

欧恩方

4700人浏览 · 2018-07-06 20:46:12

欧恩方 · 2018-07-06 20:46:12 发布

模式识别在生物信息领域的应用实例

一、问题需求

细胞是动植物的结构和功能的基本单位。人体大约由几十万亿个细胞组成，根据细胞的形态、功能等可以把细胞分为众多不同的类型，如生殖细胞、神经细胞等等。研究发现，即使是同一类的细胞，在形态、基因表达等方面仍存在着差异。为了探究细胞与细胞之间的差异及其原因，单细胞测序技术应运而生。本次作业将使用单细胞基因表达数据对细胞分类问题进行探究。

在给出的单细胞基因表达数据中，每一行为一个细胞，每一列为一个基因的表达值。我们可以把每一个细胞看做一个样本，把它对应的基因表达量作为特征向量。

现需基于上述的单细胞基因表达数据，研究如下问题：

对提供的两类细胞的单细胞RNA-seq数据设计分类器，并针对测试数据给出类别标签预测。
对提供的十类细胞的单细胞RNA-seq数据设计分类器，并针对测试数据给出类别标签预测。
对2中的十类细胞数据使用PCA或tSNE等方法进行降维和可视化，并对数据进行聚类分析，比较聚类结果和真实标签之间的关系。

未征得老师允许，这里就不上传原始数据啦，胖友们可以自己找数据集玩一下。

二、问题分析

通过读取查看给出的数据集，可以发现，无论是二分类还是十分类问题的数据量都比较少，平均每类一千多条样本，而特征维度有两万五千多，如果直接用来训练，必然会出现严重的过拟合现象；同时，在这些数据样本中，可能存在大量的对分类不起作用或相关性很高的“冗余特征”，存在噪声的特征维度，因此需要先对数据进行清洗、筛选、降维，再对降维后的数据进行训练。总体的步骤过程如下图。

这里写图片描述

首先，在读取数据之后，我们可以先删除分布完全相同的特征，可以发现这样的特征在二分类数据集中有七千多种，而十分类数据集也有一千多种。其次删除对分类无贡献或贡献很小的特征，考虑到噪声导致分布上的波动，可将方差小于0.1的特征视作所有类差别不大的特征，即对分类贡献很小的特征。此时，特征空间的维度已减小过一半。我们可以再对得到的数据进行降维，可采用主成分分析PCA等方法。

由于数据集的数据量比较少，因此可以考虑使用支持向量机SVM对二分类数据进行分类训练，当然也可以也可以考虑使用神经网络进行训练，对于十分类数据可以考虑使用多个单模型进行投票，也可以考虑支持多分类的支持向量机。

对于降维可视化，可以考虑PCA+tSNE的策略，保证降维效果和运算速率。

三、编程环境

编程语言：Anaconda3 + Tensorflow1.8
操作系统：Win10

四、二分类问题

预处理

剔除分布相同的特征

我们使用pandas读取数据集，然后检测分布相同的特征并将其视为冗余特征删除掉。

import pandas as pd
xtest = xtest.T[~x.T.duplicated()].T
x = x.T.drop_duplicates().T

将方差小于0.1的特征视为对分类帮助不大的特征，也进行删除。

var = x.var(0)
cols = np.append(cols, np.argwhere(var < 0.1))
index = list(set(i for i in range(x.shape[1])).difference(
    cols.reshape(-1)
))
x, xtest = x[:, index], xtest[:, index]

剔除含噪量较大的特征

如果一列特征中出现少于5个非零值，将其视为噪声特征，删除之。

x_not_zero = x > 0
sum_not_zero = sum(x_not_zero)
noise_cols = np.argwhere(sum_not_zero < 5)
index = list(set(i for i in range(x.shape[1])).difference(
    noise_cols.reshape(-1)
))

异常值检测

用箱型图的办法检测异常值，即计算出25%和75%的点，记为下四分位数Q~L~和上四分位数Q~U~，可计算出四分位数间距 $IQR=Q_U-Q_L$ ，则上限为 $Q_U+1.5IQR$ ，下限为 $Q_L-1.5IQR$ ，超过界限的点即为异常点。在此处采用将异常值修改为对应的四分位数。

# 异常值处理
# ql, qu为下、上四分位数, iqr为四分位数间距
x_desc = x.describe()
ql, qu = x_desc.iloc[4, :], x_desc.iloc[6, :]
iqr = qu - ql
# 设置上下限
upper = qu + 1.5 * iqr
lower = ql - 1.5 * iqr
# 异常值置于四分位点
tmp1, tmp2 = x - upper, xtest - upper
x[tmp1 > 0] = x[tmp1 > 0] - tmp1[tmp1 > 0]
xtest[tmp2 > 0] = xtest[tmp2 > 0] - tmp2[tmp2 > 0]
tmp1, tm2 = lower - x, lower - xtest
x[tmp1 > 0] = x[tmp1 > 0] + tmp1[tmp1 > 0]
xtest[tmp2 > 0] = xtest[tmp2 > 0] + tmp2[tmp2 > 0]
x, xtest = x.astype(int), xtest.astype(int)

处理完异常值后再进行一次冗余检测，输出的特征仅剩553维！对处理完的数据进行分类，精度在96%~98%之间，波动较大。因此这样对异常值的处理效果不太好，因此在最终的方案中并未采用。同时在处理过程中发现异常值有两万多个，说明很可能存在一两个异常样本。

特征提取

尽管在最终的模型中未采用特征提取的方法，但是在实验中尝试了对特征选择后的数据进行特征提取，提取后的数据在SVM中精度达97%，不如未提取的数据。

我们采用PCA方法进行降维。

y = np.array(y)
x = np.array(x)
if n_components:
    pca = PCA(n_components, whiten=True)
    pca.fit(x)
    x = pca.transform(x)
    x_test = pca.transform(x_test)

通过测试表明，如果保留95%以上的信息，将需要输出624维的特征，对于一般的分类器和简单感知器网络而言，这样的维度是难以接受的。可以考虑，能不能适当的丢失更多信息，而在保持较好的分类效果的同时，避免过拟合现象的出现呢？我们可以先粗略的从5~100以5为间隔选择输出维度，重复5次实验，取平均值，得到如下图结果：

这里写图片描述
可见在40\~80之间能达到较好的分类效果。因此同理，可分别在40\~80和50~70中分别以2、1为间隔进行测试，重复5次实验，取平均值，得到下两幅图的结果：

这里写图片描述

因此，我们选用51个维度为输出特征维度，这个维度约为样本数的2%，因此是可被接受的。使用这51维特征训练SVM，精度在96%~98%之间，显然不如直接使用全部的选择的特征；用来训练两个隐层的神经网络，精度可达97%，比未进行特征提取的精度高，后者的精度仅有95%。

数据集划分

方案一：把提供训练的数据划分为训练集（80%）、验证集（10%）、测试集（10%）；
方案二：把提供训练的数据划分为训练集（90%）、测试集（10%）；
方案三：不拆分训练数据，全部作为训练集。

其中，方案一用于除SVM之外的模型如神经网络等的参数和模型选择；方案二用于支持向量机的训练，此处采用scikit-learn的svm模块，因此不需要自己制作验证集。方案三用于最终训练支持向量机并对测试数据进行预测。

模型构造

由于选用SVM作为分类模型，SVM的核方法的思想是将低维数据映射到高维空间进行线性分类，因此事实上并不需要对数据进行降维。但考虑到计算量和噪声，仍可以将数据进行上述的筛选、剔除，得到两千多维的特征空间，在此空间中，计算量是客观的，而数据信息保留较完整。直接使用筛选后的数据进行训练即可。

在同一批的训练集和测试集划分中，线性支持向量机的准确率可达98%以上，而已调好参数的非线性支持向量机只能达97%，因此使用线性支持向量机的效果更好。

如果考虑先对筛选后的数据进行PCA降维，再使用线性SVM进行训练，也能达98%的精度，但上下起伏波动较大，因为两类数据数量分布不均匀，降维后的线性SVM最差精度在96%以下，不如未降维的数据，因此直接使用筛选后、未降维的数据。

综上，选择的模型为线性SVM，数据使用筛选清洗后、不进行PCA降维的两千多特征的数据。经实验验证，在该数据集上SVM取C=1已达最优，即采用sklearn的默认参数即可。

为了进一步提高稳定性，此处采用KFold的办法，使用十倍交叉验证构造十个线性SVM，再对测试数据进行投票，对于投票一致的样本，认为是分类准确的，将其并入训练集中，对于投票不一致的，视为存疑数据，重新利用新的整体的训练集训练，得到一个新的SVM，再利用该SVM对存疑数据进行分类。经实验验证，采用本方案得到的分类系统的稳定性比单纯使用十个SVM投票要好，平均精度也有提高。

代码如下，其中对label规范到{1，-1}，事实上，sklearn里的svm并不要求规范化：

def train(x, y, show_confusion=False):
    # 拆分训练集和测试集
    x_train, y_train, x_test, y_test = records.divide_sets(x, y, need_validate=False)
    y_train[y_train == 2], y_test[y_test == 2] = -1, -1
    y_test, y_train = y_test.reshape(-1), y_train.reshape(-1)
    # 十倍交叉验证
    kf = KFold(n_splits=10)
    preds = []
    for train_index, validate_index in kf.split(y_train):
        clf = svm.LinearSVC()
        clf.fit(x_train[train_index], y_train[train_index])
        pred = clf.predict(x_test)
        preds.append(pred)
    pred = np.mean(preds, 0)
    classified = np.array(list(x_test[pred == 1]) + list(x_test[pred == -1]))
    x_train_new = np.array(list(x_train) + list(classified))
    y_train_new = np.append(np.append(y_train, pred[pred == 1]), pred[pred == -1])

    clf = svm.LinearSVC()
    clf.fit(x_train_new, y_train_new.reshape((-1, 1)))

    for i in range(len(pred)):
        if pred[i] != 1. and pred[i] != -1.:
            pred[i] = clf.predict(x_test[i].reshape((1, -1)))[0]
    accuracy = accuracy_score(y_test, pred)
    print("Acc of all-SVC-vote in test sets is {:.4f}%".format(accuracy * 100))
    if show_confusion:
        print(confusion_matrix(y_test, pred))

实验结果

使用十倍交叉验证，可以发现精度在98%~99%之间，下述为某次测试结果：

1	2	3	4	5	6	7	8	9	10	aver
acc	0.9870	0.9826	0.9891	0.9848	0.9870	0.9891	0.9870	0.9891	0.9848	0.9891

可见线性SVM对本分类问题的精度和稳定性都还可以。在利用十倍交叉验证采用投票和构造新模型的机制后，下述为某次实验，对模型训练十次的精度，可见精度和稳定性都得到了提升。

1	2	3	4	5	6	7	8	9	10	aver
acc	0.9891	0.9913	0.9870	0.9891	0.9935	0.9935	0.9935	0.9913	0.9848	0.9957

下述为某次测试的混淆矩阵，其精度为98.9130%，可以发现在测试集中两类数据数量相差较大，其中一类约为另一类的1.5倍，这不是偶然的，事实上在给出的数据中本身就存在这样的关系，导致SVM会随着测试集抽样的不同引发不同的精度波动，从而影响分类结果的稳定性。而在本方法模型能较好地规避这个问题。

273	3
2	182

其他方法

非线性SVM

使用二次非线性SVM分类器，经过调参，发现取核函数 $K=(0.019x+22)^2$ ，取 $C=1$ 效果较好，精度在95%~97%之间，显然不如线性SVM。

使用三次非线性SVM分类器，取核函数 $K=(0.020x+22)^3$ ，取 $C=0.5$ 效果较好，精度在95%~97%之间，比二次更差一点。

使用径向基SVM分类器，取核函数系数 $\gamma=0.0075$ ，取 $C=12$ 效果较好，精度在96%~98%之间，比二次稍好。

神经网络

如果直接对高维的数据构造神经网络模型，显然由于数据量太少而特征维度太高极易出现过拟合，尽管可能在给定的数据集表现很好，但可迁移性不高，不是一个好的模型。因此使用PCA降维后的数据进行训练，训练模型为含两个隐层的全连接神经网络。在训练中可以观察到，神经网络模型在训练五六千次（每次抽取64对数据进行训练）后出现过拟合现象，表现为：随着训练次数增加，损失函数不断减少（如果设置的学习率较大则会出现loss上升的情况），但验证集表现精度不变，甚至还有下降趋势。因此可将训练次数设置为5000，一方面保证训练精度，另一方面及时防止过拟合。最终测出精度为96%~97%。

def fc_net(x_train, y_train, x_test, in_dim=100):
    x = tf.placeholder(tf.float32, [None, in_dim], name='x-input')
    y_ = tf.placeholder(tf.float32, [None, 2], name='y-input')
    with slim.arg_scope(
            [slim.fully_connected],
            weights_regularizer=slim.l2_regularizer(1e-4),
            weights_initializer=tf.truncated_normal_initializer(stddev=0.1),
            biases_initializer=tf.constant_initializer(0.1),
            activation_fn=tf.sigmoid
    ):
        out = slim.fully_connected(x, 200)
        y = slim.fully_connected(out, 2)
    loss = slim.losses.softmax_cross_entropy(y, y_)
    accuracy = tf.reduce_mean(
        tf.cast(tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1)), tf.float32)
    )
    global_step = tf.Variable(0, trainable=False)
    learning_rate = tf.train.exponential_decay(0.1, global_step, 128, 0.99)

    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    train_step = optimizer.minimize(loss, global_step=global_step)
    with tf.Session() as sess:
        tf.global_variables_initializer().run()
        x_t, y_t, x_v, y_v = records.divide_sets(x_train, y_train, need_validate=False)
        for i in range(1, 5001):
            # x_batch, y_batch = next_batch(x_train, y_train, 128)
            x_batch, y_batch = next_batch(x_t, y_t, 128)
            train_dict = {x: x_batch, y_: y_batch}
            validate_dict = {x: x_v, y_: y_v}
            sess.run(train_step, feed_dict=train_dict)
            if i % 100 == 0:
                los, acc, steps = sess.run(
                    [loss, accuracy, global_step], feed_dict=validate_dict
                )
                print('step:{}, loss:{}, acc:{}'.format(steps, los, acc))
        test_dict = {x: x_test, y_: np.zeros((x_test.shape[0], 2))}
        return sess.run(tf.argmax(y, 1) * 2 - 1, feed_dict=test_dict)

五、十分类问题

根据二分类的经验，由于数据量少而特征维度高，此处采用多分类的支持向量机最优。我们采用与二分类相同的特征筛选方法，即去除冗余特征、去除方差小于0.1的无贡献特征和少于5个非零值的噪声特征。可筛选剩三千多项特征。这大大减小了训练SVM的计算量。

我们尝试使用线性SVM，某次十倍交叉验证（并不是全部数据参与十倍交叉验证，已有一部分数据被抽出作为测试集）结果如下：

1	2	3	4	5	6	7	8	9	10	aver
acc	0.9977	0.9984	0.9984	0.9930	0.9961	0.9946	0.9992	0.9977	1	0.9961

可见精度在99.30%~100%之间波动，平均精度达99.71%，分类效果已经很好，不需再考虑非线性SVM。

同样采用与二分类一样的投票方法，重复10次实验，结果如下：

1	2	3	4	5	6	7	8	9	10	aver
acc	0.9981	0.9972	0.9978	0.9978	0.9984	0.9972	0.9963	0.9966	0.9960	0.9972

精度在99.60%~99.84%之间波动，平均精度达99.73%，精度和稳定性都有所提高。

某次分类结果的混淆矩阵如下：

245	0	0	0	0	0	0	0	0	0
0	282	0	0	0	0	0	0	0	0
0	0	258	0	0	0	0	0	0	0
0	1	0	283	0	0	0	2	0	0
0	0	0	0	282	0	0	0	0	0
0	0	0	0	0	264	0	0	0	0
0	1	0	0	0	0	543	0	1	1
2	0	0	0	0	0	0	483	0	0
0	0	0	0	0	0	2	0	294	0
0	0	0	0	0	0	2	0	0	270

可见给定的数据集同样存在数据数量不均匀的问题，该问题容易造成分类精度的降低，引起精度波动。采用投票的办法可提高鲁棒性。

六、聚类

为了降低计算量，我们可以考虑对筛选剩的三千多项特征进行降维。由于tSNE的降维效果最好，但运行很慢，因此可以考虑采用PCA+tSNE的策略，即用PCA降维至小维度空间，再用tSNE进行降维可视化，这样在保证运行速率的同时使得降维效果有所提高。

先用PCA将特征空间降到50维，再用tSNE降维至三维，得到如下散点图：

这里写图片描述

可见十类样本分层明显，但存在个别异常点，原因可能有：特征选择时丢失信息；PCA降维仅保存了特征选择后的95%的信息，进一步丢失数据；样本本身存在异常值。

如果采用PCA直接降到3维空间，得下图：
这里写图片描述
显然十类样本交叠较明显，可分性不够高。因此在降维效果上PCA不如tSNE。但值得注意的是，尽管已实现使用PCA将特征降至50维，此时再采用tSNE降维所需要的时间仍然很多，在本机测试数需要20分钟才能得出降维结果；而如果先用PCA降维至10维，再用tSNE进行降维，也需要十分钟。综上，采用PCA+tSNE的策略可以兼顾两者的优点，提高降维的效果。

从tSNE的降维可视化可见该算法降维结果具有明显的分层，因此可以考虑直接使用tSNE降维后的三维数据进行聚类。设置10个聚类中心，采用KMeans算法进行聚类，将聚类结果每一簇的点的真实类别放入到一个集合中，统计每一个集合的众数，将集合中与众数相等的点视为分类准确点，可计算得到聚类的准确性为80.05%。但是，我们观察每个集合的众数，分别为：1、8、9、10、4、5、7、7、6、3，可以发现，集合的众数存在重复，即同一类可能存在相差较远的两个聚类中心，因此使用与分类数相同的聚类数进行聚类效果并不一定是最好，但是聚类数的确定仍需靠经验或者实验测得。

七、总结

对于原始的采样数据，可能存在大量冗余信息和噪声信息，在进行分类之前需要对数据进行预处理，一般来说，去掉方差较小的特征对分类效果影响不大，反而可以提高运算速度。对于异常值的处理需要谨慎，常用的异常值处理方法大多依靠经验，可能会对分类产生较大影响。

对于数据量少、特征维度高的数据集，采用支持向量机的模型进行分类可得到较好效果；在使用支持向量机时，如果线性SVM能满足要求，就不需要考虑多项式核SVM，如果采用多项式核SVM能满足要求，就不需要考虑高斯核SVM。为了提高算法鲁棒性，可考虑使用多个模型组建投票机进行投票。对于此类数据集，如果采用神经网络模型，很容易出现过拟合现象，此时可考虑数据降维、及时停止训练、降低学习率等办法，尽量避免过拟合的产生。

此数据集中数据分布不均匀，不仅会对数据抽样划分的结果产生影响，还会影响模型的建立，导致模型准确性产生波动，影响稳定性。这时候可以引入存疑数据的概念，将投票不一致的数据记为存疑数据，采用其他模型进行评估，一般来说可以提高稳定性，但这其中的关键是这里的其他模型的设计存在难度。因此，在数据采用的时候应尽量保证采样数量的均匀。

对于降维算法而言，PCA的运算量较少，运算起来较快，而tSNE运算巨慢，需要大量的时间，但降维效果明显比PCA好，因此可以综合考虑两种算法，可以先使用PCA降维到较低空间，再使用tSNE进行降维可视化。

PS：存疑数据的处理是我自己想出来的，没有经过数学证明收敛性，欢迎胖友们指正啦~

附：完整代码

records.py

# -*- coding:utf-8 -*-
import pandas as pd
import random
import tensorflow as tf
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.metrics import accuracy_score


def create_record(cwd, num_classes):
    """
    :param cwd: 当前工作路径
    :param num_classes: 数据集的类别数，用于选择数据集
    """
    # 读取文本
    path = cwd + "/" + str(num_classes) + "classes/" + str(num_classes)
    x = pd.read_csv(path + 'ctrainX.txt', sep='\t', header=None,
                    dtype=int, na_filter=False).values
    y = pd.read_csv(path + 'ctrainY.txt', sep='\t', header=None,
                    dtype=int, na_filter=False).values
    xtest = pd.read_csv(path + 'ctestX.txt', sep='\t', header=None,
                        dtype=int, na_filter=False).values
    # 简单的数据清洗
    # 除去重复列
    # xtest = xtest.T[~x.T.duplicated()].T
    # x = x.T.drop_duplicates().T
    # # 异常值处理
    # # ql, qu为下、上四分位数, iqr为四分位数间距
    # x_desc = x.describe()
    # ql, qu = x_desc.iloc[4, :], x_desc.iloc[6, :]
    # iqr = qu - ql
    # # 设置上下限
    # upper = qu + 1.5 * iqr
    # lower = ql - 1.5 * iqr
    # # 异常值置于四分位点
    # tmp1, tmp2 = x - upper, xtest - upper
    # x[tmp1 > 0] = x[tmp1 > 0] - tmp1[tmp1 > 0]
    # xtest[tmp2 > 0] = xtest[tmp2 > 0] - tmp2[tmp2 > 0]
    # tmp1, tm2 = lower - x, lower - xtest
    # x[tmp1 > 0] = x[tmp1 > 0] + tmp1[tmp1 > 0]
    # xtest[tmp2 > 0] = xtest[tmp2 > 0] + tmp2[tmp2 > 0]
    # x, xtest = x.astype(int), xtest.astype(int)
    # xtest = xtest.T[~x.T.duplicated()].T.values
    # x = x.T.drop_duplicates().T.values
    print('{}{}'.format(x.shape, xtest.shape))
    # 删除列全部为0的冗余特征，列中只出现少于5个非零值（约0.5%）的噪声特征
    # 删除方差差小于0.1的冗余特征
    x_not_zero = x > 0
    sum_not_zero = sum(x_not_zero)
    noise_cols = np.argwhere(sum_not_zero < 5)
    var = x.var(0)
    noise_cols = np.append(noise_cols, np.argwhere(var < 0.1))
    index = list(set(i for i in range(x.shape[1])).difference(
        noise_cols.reshape(-1)
    ))
    x, xtest = x[:, index], xtest[:, index]
    print(x.shape)
    # # 归一化数据，归一化后效果反而变差
    # x_max, x_min, x_mean = x.max(0), x.min(0), x.mean(0)
    # x = (x - x_mean) / (x_max - x_min)
    # xtest = (xtest - x_mean) / (x_max - x_min) * 2
    # 制作训练集
    writer = tf.python_io.TFRecordWriter(path + "train.tfrecords")
    for index in range(y.shape[0]):
        label = tf.train.Int64List(value=y[index])
        genes = tf.train.Int64List(value=x[index])
        feature = {
            "label": tf.train.Feature(int64_list=label),
            "genes": tf.train.Feature(int64_list=genes)
        }
        example = tf.train.Example(
            features=tf.train.Features(feature=feature)
        )
        writer.write(example.SerializeToString())
    writer.close()

    # 制作测试集
    writer = tf.python_io.TFRecordWriter(path + "test.tfrecords")
    for each in xtest:
        genes = tf.train.Int64List(value=list(each))
        feature = {
            "genes": tf.train.Feature(int64_list=genes)
        }
        example = tf.train.Example(
            features=tf.train.Features(feature=feature)
        )
        writer.write(example.SerializeToString())
    writer.close()


def preprocessing(cwd, num_classes, n_components=None):
    """
    :param cwd: 当前工作路径
    :param num_classes: 数据集的类别数，用于选择数据集
    :param n_components: 降维后的维度，默认None，表示不降维
    """
    data_train = "{}/{}classes/{}train.tfrecords".format(cwd, num_classes, num_classes)
    data_test = "{}/{}classes/{}test.tfrecords".format(cwd, num_classes, num_classes)
    x, y, x_test = [], [], []
    for serialized_example in tf.python_io.tf_record_iterator(data_train):
        example = tf.train.Example()
        example.ParseFromString(serialized_example)
        y.append(example.features.feature['label'].int64_list.value)
        x.append(example.features.feature['genes'].int64_list.value)
    for serialized_example in tf.python_io.tf_record_iterator(data_test):
        example = tf.train.Example()
        example.ParseFromString(serialized_example)
        x_test.append(example.features.feature['genes'].int64_list.value)
    y = np.array(y)
    x = np.array(x)
    x_test = np.array(x_test)
    if n_components:
        pca = PCA(n_components, whiten=True)
        pca.fit(x)
        x = pca.transform(x)
        x_test = pca.transform(x_test)
        # 计算信息量
        s = 0
        for each in pca.explained_variance_ratio_:
            s += each
        print(s)
    return x, y, x_test

    # # 筛选输出特征维度
    # x = np.array(x)
    # x_train, y_train, x_test, y_test = divide_sets(x, y, need_validate=False)
    # y_train[y_train == 2] = -1
    # y_test[y_test == 2] = -1
    # acc = []
    # # 50~70
    # for i in range(50, 70):
    #     pca = PCA(i, whiten=True)
    #     pca.fit(x_train)
    #     x = pca.transform(x_train)
    #     x_ = pca.transform(x_test)
    #     clf = svm.LinearSVC()
    #     clf.fit(x, y_train)
    #     pred = clf.predict(x_)
    #     accuracy = accuracy_score(y_test, pred)
    #     acc.append(accuracy)
    #     print("Acc of LinearSVC is {:.4f}%".format(accuracy * 100))
    # 计算信息量
    # s, cnt = 0, 0
    # for each in pca.explained_variance_ratio_:
    #     s += each
    #     cnt += 1
    #     10c: cnt=13  -> 0.90
    #      2c: cnt=245 -> 0.90
    #      2c: cnt=624 -> 0.95
    #     if s > 0.90:
    #         break
    # print(s)
    # print(cnt)
    # return acc


def divide_sets(x_sets, y_sets, need_validate=True):
    """
    :param x_sets: 基因数据集
    :param y_sets: 标签数据集
    :param need_validate: 是否产生验证集，在sklearn.svm中不需要手动产生验证集
    :return: 训练集、（验证集、）测试集
    """
    # 训练神经网络等需要自己制作验证集
    # 使用svm模块不需要自己制作验证集
    if need_validate:
        # 随机抽取测试集和验证集
        length = round(y_sets.shape[0] * 0.1)
        random_list = random.sample(
            range(y_sets.shape[0]), length * 2
        )
        validate_index, test_index = random_list[:length], random_list[length:]
        train_index = list(
            set(i for i in range(y_sets.shape[0])).difference(set(random_list))
        )
        x_train, y_train = x_sets[train_index], y_sets[train_index]
        x_validate, y_validate = x_sets[validate_index], y_sets[validate_index]
        x_test, y_test = x_sets[test_index], y_sets[test_index]
        return x_train, y_train, x_validate, y_validate, x_test, y_test
    else:
        # 随机抽取测试集
        length = round(y_sets.shape[0] * 0.2)
        test_index = random.sample(
            range(y_sets.shape[0]), length
        )
        train_index = list(
            set(i for i in range(y_sets.shape[0])).difference(set(test_index))
        )
        x_train, y_train = x_sets[train_index], y_sets[train_index]
        x_test, y_test = x_sets[test_index], y_sets[test_index]
        return x_train, y_train, x_test, y_test

make_record.py

# -*- coding:utf-8 -*-
import os
import records

# 读取当前路径
cwd = os.getcwd()
# 制造record
records.create_record(cwd, 2)
records.create_record(cwd, 10)

classification2.py

# -*- coding:utf-8 -*-
import os
import numpy as np
from sklearn import svm
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, confusion_matrix
import records
import tensorflow.contrib.slim as slim
import tensorflow as tf
import random
import pandas as pd


def train(x, y, show_confusion=False):
    # 拆分训练集和测试集
    x_train, y_train, x_test, y_test = records.divide_sets(x, y, need_validate=False)
    # y_train1, y_test1 = y_train.copy(), y_test.copy()
    y_train[y_train == 2], y_test[y_test == 2] = -1, -1
    y_test, y_train = y_test.reshape(-1), y_train.reshape(-1)
    # y_train1[y_train1 == 2], y_test1[y_test1 == 2] = 0, 0
    # y_test1, y_train1 = y_test1.reshape(-1), y_train1.reshape(-1)
    # y_train2, y_test2 = np.eye(2)[y_train1], np.eye(2)[y_test1]
    # # 训练线性SVM
    # clf = svm.LinearSVC()
    # clf.fit(x_train, y_train)
    # pred = clf.predict(x_test)
    # accuracy = accuracy_score(y_test, pred)
    # print("Acc of LinearSVC is {:.4f}%".format(accuracy * 100))
    # 十倍交叉验证
    kf = KFold(n_splits=10)
    preds = []
    # fc_preds =[]
    for train_index, validate_index in kf.split(y_train):
        clf = svm.LinearSVC()
        clf.fit(x_train[train_index], y_train[train_index])
        # pred = clf.predict(x_train[validate_index])
        # accuracy = accuracy_score(y_train[validate_index], pred)
        # print("Acc of LinearSVC in validate sets is {:.4f}%".format(accuracy * 100))
        pred = clf.predict(x_test)
        # fc_pred = fc_net(x_train, y_train2, x_test, y_test2, in_dim=51)
        # print(accuracy_score(y_test, fc_pred))
        # fc_preds.append(fc_pred)
        # accuracy = accuracy_score(y_test, pred)
        # print("Acc of LinearSVC in test sets is {:.4f}%".format(accuracy * 100))
        preds.append(pred)
    pred = np.mean(preds, 0)
    classified = np.array(list(x_test[pred == 1]) + list(x_test[pred == -1]))
    x_train_new = np.array(list(x_train) + list(classified))
    y_train_new = np.append(np.append(y_train, pred[pred == 1]), pred[pred == -1])

    clf = svm.LinearSVC()
    clf.fit(x_train_new, y_train_new.reshape((-1, 1)))

    for i in range(len(pred)):
        if pred[i] != 1. and pred[i] != -1.:
            pred[i] = clf.predict(x_test[i].reshape((1, -1)))[0]
    # 神经网络不如SVM
    # fc_pred = np.sum(fc_preds, 0)
    # fc_pred[fc_pred > 0], fc_pred[fc_pred <= 0] = 1, -1
    accuracy = accuracy_score(y_test, pred)
    # accuracy1 = accuracy_score(y_test, fc_pred)
    # print('mean cnn {:.4f}%'.format(np.mean(fc_preds) * 100))
    print("Acc of all-SVC-vote in test sets is {:.4f}%".format(accuracy * 100))
    if show_confusion:
        print(confusion_matrix(y_test, pred))
    # print("Acc of fc-net in test sets is {:.4f}%".format(accuracy1 * 100))
    # # 多项式和径向基均不如线性
    # clf1 = svm.SVC(C=1, kernel='poly', degree=2, gamma=0.0019, coef0=22)
    # clf1 = svm.SVC(C=0.5, kernel='poly', degree=3, gamma=0.0020, coef0=22)
    # clf1 = svm.SVC(C=12, kernel='rbf', gamma=0.0075)
    # clf1.fit(x_train, y_train)
    # pred1 = clf1.predict(x_test)
    # accuracy1 = accuracy_score(y_test, pred1)
    # print("Acc of RBFSVC is {:.4f}%".format(accuracy1 * 100))
    # return accuracy, accuracy1


def test(x_train, y_train, x_test):
    y_train[y_train == 2] = -1
    y_train = y_train.reshape(-1)
    # 10Fold
    kf = KFold(n_splits=10)
    preds = []
    for train_index, _ in kf.split(y_train):
        clf = svm.LinearSVC()
        clf.fit(x_train[train_index], y_train[train_index])
        pred = clf.predict(x_test)
        preds.append(pred)
    pred = np.mean(preds, 0)
    classified = np.array(list(x_test[pred == 1]) + list(x_test[pred == -1]))
    x_train_new = np.array(list(x_train) + list(classified))
    y_train_new = np.append(np.append(y_train, pred[pred == 1]), pred[pred == -1])

    clf = svm.LinearSVC()
    clf.fit(x_train_new, y_train_new.reshape((-1, 1)))

    for i in range(len(pred)):
        if pred[i] != 1. and pred[i] != -1.:
            pred[i] = clf.predict(x_test[i].reshape((1, -1)))[0]
    pred[pred == -1] = 2
    pred_data_frame = pd.DataFrame(pred.astype(int))
    pred_data_frame.to_csv(
        '2classes\\2ctestY.txt',
        header=None,
        index=None,
        encoding='utf-8'
    )


def knn(sets, data, k):
    d = np.matmul(sets, data)
    return d.argsort()[:k]


def fc_net(x_train, y_train, x_test, in_dim=100):
    x = tf.placeholder(tf.float32, [None, in_dim], name='x-input')
    y_ = tf.placeholder(tf.float32, [None, 2], name='y-input')
    with slim.arg_scope(
            [slim.fully_connected],
            weights_regularizer=slim.l2_regularizer(1e-4),
            weights_initializer=tf.truncated_normal_initializer(stddev=0.1),
            biases_initializer=tf.constant_initializer(0.1),
            activation_fn=tf.sigmoid
    ):
        out = slim.fully_connected(x, 200)
        y = slim.fully_connected(out, 2)
    loss = slim.losses.softmax_cross_entropy(y, y_)
    accuracy = tf.reduce_mean(
        tf.cast(tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1)), tf.float32)
    )
    global_step = tf.Variable(0, trainable=False)
    learning_rate = tf.train.exponential_decay(0.1, global_step, 128, 0.99)

    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    train_step = optimizer.minimize(loss, global_step=global_step)
    with tf.Session() as sess:
        tf.global_variables_initializer().run()
        x_t, y_t, x_v, y_v = records.divide_sets(x_train, y_train, need_validate=False)
        for i in range(1, 5001):
            # x_batch, y_batch = next_batch(x_train, y_train, 128)
            x_batch, y_batch = next_batch(x_t, y_t, 128)
            train_dict = {x: x_batch, y_: y_batch}
            validate_dict = {x: x_v, y_: y_v}
            sess.run(train_step, feed_dict=train_dict)
            if i % 100 == 0:
                los, acc, steps = sess.run(
                    [loss, accuracy, global_step], feed_dict=validate_dict
                )
                print('step:{}, loss:{}, acc:{}'.format(steps, los, acc))
        test_dict = {x: x_test, y_: np.zeros((x_test.shape[0], 2))}
        return sess.run(tf.argmax(y, 1) * 2 - 1, feed_dict=test_dict)


def next_batch(x, y, batch_size=64):
    random_list = random.sample(
        range(y.shape[0]), batch_size
    )
    return x[random_list], y[random_list]


def main():
    cwd = os.getcwd()
    x, y, x_test = records.preprocessing(cwd, 2)
    train(x, y, show_confusion=True)
    test(x, y, x_test)


if __name__ == "__main__":
    main()

classification10.py

# -*- coding:utf-8 -*-
import os
import numpy as np
from sklearn import svm
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, confusion_matrix
import records
import pandas as pd


def train(x, y, show_confusion=False):
    # 拆分训练集和测试集
    x_train, y_train, x_test, y_test = records.divide_sets(x, y, need_validate=False)
    y_test, y_train = y_test.reshape(-1), y_train.reshape(-1)
    # 十倍交叉验证
    kf = KFold(n_splits=10)
    preds = []
    for train_index, validate_index in kf.split(y_train):
        clf = svm.LinearSVC()
        clf.fit(x_train[train_index], y_train[train_index])
        pred = clf.predict(x_train[validate_index])
        accuracy = accuracy_score(y_train[validate_index], pred)
        print("Acc of LinearSVC in validate sets is {:.4f}%".format(accuracy * 100))
        pred = clf.predict(x_test)
        preds.append(pred)
    preds = np.array(preds)
    var = preds.var(0)
    x_train_new = np.array(list(x_train) + list(x_test[var == 0]))
    y_train_new = np.append(y_train, preds[0][var == 0])

    clf = svm.LinearSVC()
    clf.fit(x_train_new, y_train_new.reshape((-1, 1)))

    pred = []
    for i in range(var.shape[0]):
        if var[i] != 0:
            pred.append(clf.predict(x_test[i].reshape((1, -1))))
        else:
            pred.append(preds[0, i])

    accuracy = accuracy_score(y_test, pred)
    print("Acc of all-SVC-vote in test sets is {:.4f}%".format(accuracy * 100))
    if show_confusion:
        print(confusion_matrix(y_test, pred))


def test(x_train, y_train, x_test):
    y_train = y_train.reshape(-1)
    # 10Fold
    kf = KFold(n_splits=10)
    preds = []
    for train_index, validate_index in kf.split(y_train):
        clf = svm.LinearSVC()
        clf.fit(x_train[train_index], y_train[train_index])
        pred = clf.predict(x_test)
        preds.append(pred)
    preds = np.array(preds)
    var = preds.var(0)
    x_train_new = np.array(list(x_train) + list(x_test[var == 0]))
    y_train_new = np.append(y_train, preds[0][var == 0])

    clf = svm.LinearSVC()
    clf.fit(x_train_new, y_train_new.reshape((-1, 1)))

    pred = []
    for i in range(var.shape[0]):
        if var[i] != 0:
            pred.append(int(clf.predict(x_test[i].reshape((1, -1)))))
        else:
            pred.append(int(preds[0, i]))

    pred_data_frame = pd.DataFrame(pred)
    pred_data_frame.to_csv(
        '10classes\\10ctestY.txt',
        header=None,
        index=None,
        encoding='utf-8'
    )


def main():
    cwd = os.getcwd()
    x, y, x_test = records.preprocessing(cwd, 10)
    train(x, y, show_confusion=True)
    test(x, y, x_test)


if __name__ == "__main__":
    main()

clustering.py

# -*- coding:utf-8 -*-
import os
from sklearn.manifold import TSNE
import records
import matplotlib.pyplot as plt
import time
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
from scipy.stats import mode


def main():
    cwd = os.getcwd()
    time1 = time.time()
    x, y, _ = records.preprocessing(cwd, 10, n_components=50)
    y = y.reshape(-1)
    tsne = TSNE(n_components=3)
    x = tsne.fit_transform(x)
    time2 = time.time()
    print(time2 - time1)
    color = ['r', 'b', 'g', 'y', 'k', 'c', 'gold', 'purple', 'm', 'pink']
    ax = plt.subplot(111, projection='3d')
    for i in range(10):
        x_i = x[y == i + 1]
        ax.scatter(x_i[:, 0], x_i[:, 1], x_i[:, 2], color[i])
    ax.grid()
    plt.show()

    clf = KMeans(n_clusters=10)
    s = clf.fit(x)
    correct_nums = 0
    for i in range(10):
        m = mode(y[s.labels_ == i])
        correct_nums += m.count[0]
        print(m.mode[0])
    print("Acc of KMeans is {:.4f}".format(correct_nums / y.shape[0] * 100))
    time3 = time.time()
    print(time3 - time2)


if __name__ == "__main__":
    main()

README.txt

# 最终结果保存在result文件夹下，其中"2ctestY.txt"和"10ctestY.txt"分别为二分类和十分类的预测结果。
# 源文件保存在src文件夹下。

# 环境：Win10 + Anaconda3 + Tensorflow1.8
###################################
#   运行说明
0. records.py里编写了数据预处理和PCA降维、制作和读取.tfrecords文件、抽取训练集验证集测试集等函数。
1. 将两类数据的训练集和测试集均放入2classes文件夹中，十分类数据则放入10classes文件夹中。
2. 运行make_record.py文件（必须!），将原始数据进行初步筛选并存为程序读取更快的.tfrecords格式。
3. 运行classification2.py执行二分类任务。
4. 运行classification10.py执行十分类任务。
5. 运行clustering.py执行降维可视化、聚类任务。
###################################