机器学习：python 实现一个linear regression

1.原理介绍linear regression步骤：1.导入数据2.将数据分为训练集合测试集（linear regression 分为x_train, x_text, y_train, y_test）3.导入线性回归算法利用训练集计算出模型参数4.模型检验利用测试集测试真实值和预测值的差异（用x_test计算出y_predict，与y_test做比较，计算误差）5

TigerTai98

10737人浏览 · 2017-06-09 12:14:16

TigerTai98 · 2017-06-09 12:14:16 发布

1.原理介绍

linear regression步骤：
1.导入数据
2.将数据分为训练集合测试集
（linear regression 分为x_train, x_text, y_train, y_test）
3.导入线性回归算法
利用训练集计算出模型参数
4.模型检验
利用测试集测试真实值和预测值的差异
（用x_test计算出y_predict，与y_test做比较，计算误差）
5.打印结果

h $_\theta$ (x)表示需要预测的变量（图中指额度）
$\theta$ 是参数（反映自变量对结果的影响权重）
x是自变量（图中指工资、年龄）
注意：x, $\theta$ 都是向量
这里写图片描述

首先我们需要使用一定的测试数据来调参，确定theta的值
那么该怎么确定呢？
新建一个 $\varepsilon$ $^{(i)}$ 表示误差项
这里写图片描述
分析：
1.独立同分布：每个人的工资、年龄独立，而银行提供的贷款依据是相同的
2.高斯分布： $\varepsilon$ $^{(i)}$ 一般不会太大，而且 $\varepsilon$ $^{(i)}$ 关于0对称分布， $\varepsilon$ $^{(i)}$ 越趋于0概率越大

$\varepsilon$ $^{(i)}$ =y $^{(i)}$ - $\theta$ $^T$ x $^{(i)}$
每一个i分量对结果都有影响，所以需要将每一个分量都相乘
$\varepsilon$ $^{(i)}$ 越小拟合越接近，此时概率值越大，因此最后需要求L( $\theta$ )MAX
这里写图片描述

乘积不好处理，取对数
转化为求J( $\theta$ )MIN
这里写图片描述

求梯度计算得到J( $\theta$ )最小值
过程解释：
1.X $\theta$ -Y是一个列向量。平方和可以写成向量的转置乘以他本身。
2.A是对称矩阵时 $\nabla$ $_\theta$ ( $\theta$ $^T$ A $\theta$ )=2A $\theta$
3.用python 语言表示最终的结果就是

import numpy as np
#调用numpy里的求逆函数
X_=np.linalg.inv(X.T.dot(X))
#X.T表示转置，X.dot(Y)表示矩阵相乘
theta=X.dot(X.T).dot(Y)

这里写图片描述

2.代码实现

具体代码实现就是：

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets

class LinearRegression():
    def __init__(self):#新建变量
        self.w = None

    def fit(self, X, y):#训练集的拟合
        X = np.insert(X, 0, 1, axis=1)#增加一个维度
        print (X.shape)        
        X_ = np.linalg.inv(X.T.dot(X))#公式求解
        self.w = X_.dot(X.T).dot(y)

    def predict(self, X):#测试集的测试反馈
        #h(theta)=theta.T.dot(X)
        # Insert constant ones for bias weights
        X = np.insert(X, 0, 1, axis=1)
        y_pred = X.dot(self.w)
        return y_pred

def mean_squared_error(y_true, y_pred):
#真实数据与预测数据之间的差值（平方平均）
    mse = np.mean(np.power(y_true - y_pred, 2))
    return mse

def main():
    #第一步：导入数据
    # Load the diabetes dataset
    diabetes = datasets.load_diabetes()

    # Use only one feature
    X = diabetes.data[:, np.newaxis, 2]
    print (X.shape)

    #第二步：将数据分为训练集以及测试集
    # Split the data into training/testing sets
    x_train, x_test = X[:-20], X[-20:]

    # Split the targets into training/testing sets
    y_train, y_test = diabetes.target[:-20], diabetes.target[-20:]

    #第三步：导入线性回归类（之前定义的）
    clf = LinearRegression()
    clf.fit(x_train, y_train)#训练
    y_pred = clf.predict(x_test)#测试

    #第四步：测试误差计算（需要引入一个函数）
    # Print the mean squared error
    print ("Mean Squared Error:", mean_squared_error(y_test, y_pred))

    #matplotlib可视化输出
    # Plot the results
    plt.scatter(x_test[:,0], y_test,  color='black')#散点输出
    plt.plot(x_test[:,0], y_pred, color='blue', linewidth=3)#预测输出
    plt.show()