《Web安全之机器学习入门》笔记：第七章 7.4朴素贝叶斯检测WebShell（一）

1.源码修改（1）报错UnicodeDecodeError: 'gbk' codec can't decode byte 0x9a in position 8: illegal multibyte sequenceLoad ../data/PHP-WEBSHELL/xiaoma/1148d726e3bdec6db65db30c08a75f80.phpTraceback (most recent c

mooyuan

544人浏览 · 2022-01-30 23:55:14

mooyuan · 2022-01-30 23:55:14 发布

1.源码修改

（1）报错

UnicodeDecodeError: 'gbk' codec can't decode byte 0x9a in position 8: illegal multibyte sequence

Load ../data/PHP-WEBSHELL/xiaoma/1148d726e3bdec6db65db30c08a75f80.php
Traceback (most recent call last):
......
  t=load_file(file_path)
  for line in f:
UnicodeDecodeError: 'gbk' codec can't decode byte 0x9a in position 8: illegal multibyte sequence

将代码改为

def load_file(file_path):
    t=""
    with open(file_path,encoding='utf-8') as f:
        for line in f:
            line=line.strip('\n')
            t+=line
    return t

（2）报错2：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbe in position 15: invalid start byte

Load ../data/PHP-WEBSHELL/xiaoma/6b2548e859dd00dbf9e11487597b2c06.php
Traceback (most recent call last): 
    t=load_file(file_path)
    for line in f:
  File "C:\ProgramData\Anaconda3\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbe in position 15: invalid start byte

报这个错的话，将这个文件另存为，改为utf-8编码

2.数据集处理之黑白样本获取

本节使用的数据集是在互联网搜集到的黑样本，也就是各种大马和小马的集合。

打开小马的目录，可以看到有54个php后缀的小马文件

打开一个文件，可以看到内容为一句话木马

样本应包括黑样本和白羊吧，对于基于Webshell的文本特征进行WebShell的检测，上文提到本文采用在互联网上搜集到的Webshell作为黑样本，那么白样本则是采用当前最新的wordpress源码，如下所示为白样本

3.样本向量化

在本文中php后缀的文件为黑白样本，需要将其转换为向量的方式。将一个PHP文件作为一个字符串处理，以基于单词2-gram切割，遍历全部文件形成基于2-gram的词汇表。然后进一步将每个PHP文件向量化

webshell的的思路为，将php webshell文件按照单词分词后(正则r'\b\w+\b')，按照2-gram算法得到词集，从而得到文件每一行在该词集上的分布情况，得到特征向量；然后将正常的php文件也按照如上方法在如上词集上得到特征向量。

（1）何为N-gram与2-gram

N-gram是机器学习中NLP处理中的一个较为重要的语言模型，它的基本思想是将文本里面的内容按照字节进行大小为N的滑动窗口操作，形成了长度是N的字节片段序列。n-gram模型是指n个连续的单词组成的序列。N=1时称为unigram，N=2称为bigram，N=3称为trigram，以此类推。

该模型基于这样一种假设，第N个词的出现只与前面N-1个词相关，而与其它任何词都不相关，整句的概率就是各个词出现概率的乘积。这些概率可以通过直接从语料中统计N个词同时出现的次数得到。常用的是二元的Bi-Gram和三元的Tri-Gram。

（2）黑样本

代码如下：

    webshell_bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), decode_error="ignore",
                                        token_pattern = r'\b\w+\b',min_df=1)
    webshell_files_list=load_files("../data/PHP-WEBSHELL/xiaoma/")
    x1=webshell_bigram_vectorizer.fit_transform(webshell_files_list).toarray()
    print(len(x1), x1[0])
    y1=[1]*len(x1)

打印feature

print(webshell_bigram_vectorizer.get_feature_names())

结果如下：

打印vocabulary

    vocabulary=webshell_bigram_vectorizer.vocabulary_

内容如下所示

（3）白样本：使用黑样本生成的词汇表，将白样本特征化

代码如下

    vocabulary=webshell_bigram_vectorizer.vocabulary_
    wp_bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), 
decode_error="ignore", token_pattern = r'\b\w+\b',min_df=1,vocabulary=vocabulary)
    wp_files_list=load_files("../data/wordpress/")
    x2=wp_bigram_vectorizer.fit_transform(wp_files_list).toarray()
    print(len(x2), x2[0])
    y2=[0]*len(x2)

（4）构造训练集

代码如下

    x=np.concatenate((x1,x2))
    y=np.concatenate((y1, y2))

5.完整代码如下：

基本运行环境为python3，如下为修改过可以正常运行的源码

import os
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from sklearn import model_selection
from sklearn.naive_bayes import GaussianNB


def load_file(file_path):
    t=""
    with open(file_path, encoding='utf-8') as f:
        for line in f:
            line=line.strip('\n')
            t+=line
    return t


def load_files(path):
    files_list=[]
    for r, d, files in os.walk(path):
        for file in files:
            if file.endswith('.php'):
                file_path=path+file
                #print("Load %s" % file_path)
                t=load_file(file_path)
                files_list.append(t)
    return  files_list



if __name__ == '__main__':

    webshell_bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), decode_error="ignore",token_pattern = r'\b\w+\b',min_df=1)
    webshell_files_list=load_files("../data/PHP-WEBSHELL/xiaoma/")
    x1=webshell_bigram_vectorizer.fit_transform(webshell_files_list).toarray()
    print(len(x1), x1[0])
    y1=[1]*len(x1)

    vocabulary=webshell_bigram_vectorizer.vocabulary_
    wp_bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), 
decode_error="ignore", token_pattern = r'\b\w+\b',min_df=1,vocabulary=vocabulary)
    wp_files_list=load_files("../data/wordpress/")
    x2=wp_bigram_vectorizer.fit_transform(wp_files_list).toarray()
    print(len(x2), x2[0])
    y2=[0]*len(x2)
    x=np.concatenate((x1,x2))
    y=np.concatenate((y1, y2))

    clf = GaussianNB()
    # 使用三折交叉验证
    scores = model_selection.cross_val_score(clf, x, y, n_jobs=1, cv=3)
    print(scores)
    print(scores.mean())

6.运行结果（3折交叉验证）

[0.71153846 0.88235294 0.74509804]
0.7796631473102061

7.10折交叉验证结果

代码如下

    # 使用三折交叉验证
    scores = model_selection.cross_val_score(clf, x, y, n_jobs=1, cv=10)
    print(scores)
    print(scores.mean())

运行结果如下

[0.75       0.4375     0.625      0.6875     0.73333333 0.66666667
 0.73333333 0.53333333 0.46666667 0.53333333]
0.6166666666666666

亚马逊云科技技术品牌专区

更多推荐

企业物联网平台如何选择？

亚马逊云科技技术品牌专区

STM32节点移植lorawan协议连接腾讯云物联网开发平台（IoT Explorer）

STM32移植lorawan协议连接腾讯云物联网开发平台（IoT Explorer）前言前言在移植协议之前，先给大家科普一下Lora 和 lorawan 的区别。LoRa 是LPWAN通信技术中的一种，是美国Semtech公司采用和推广的一种基于扩频技术的超远距离无线传输方案。这一方案改变了以往关于传输距离与功耗的折衷考虑方式为用户提供一种简单的能实现远距离、长电池寿命、大容量的系统，进而扩...

亚马逊云科技技术品牌专区

从华为的MQTT到TdEngineRPC，解读物联网时代的分布式

今天中秋节，笔者首先祝各位读者们中秋快乐，之所以在今天这个团圆节来谈分布式的话题，就是要聊聊物联网是如何通过MQTT连接各类终端，如何通过RPC整合各种数据的。下面就通过代码+动图的方式来解读一下华为LiteOS的MQTT与TD的RPC。MQTT协议MQTT是一个客户机服务器发布/订阅消息传输协议。它重量轻、开放、简单、易于实现。这些特性使其非常适合在物联网的低带宽、...