机器学习之基于xgboost的特征筛选

本文主要是基于xgboost进行特征选择，很多人都知道在后面的模型选择时，xgboost模型是一个非常热门的模型。但其实在前面特征选择部分，基于xgboost进行特征筛选也大有可为。#coding=utf-8import pandas as pdimport xgboost as xgbimport os,random,pickleos.mkdir('featurescore'...

语亦情非

4334人浏览 · 2020-03-19 01:15:43

语亦情非 · 2020-03-19 01:15:43 发布

本文主要是基于xgboost进行特征选择，很多人都知道在后面的模型选择时，xgboost模型是一个非常热门的模型。但其实在前面特征选择部分，基于xgboost进行特征筛选也大有可为。

#coding=utf-8

import pandas as pd
import xgboost as xgb
import os,random,pickle


os.mkdir('featurescore')



train = pd.read_csv('../../data/train/train_x_rank.csv')
train_target = pd.read_csv('../../data/train/train_master.csv',encoding='gb18030')[['Idx','target']]
train = pd.merge(train,train_target,on='Idx')
train_y = train.target
train_x = train.drop(['Idx','target'],axis=1)
dtrain = xgb.DMatrix(train_x, label=train_y)

test = pd.read_csv('../../data/test/test_x_rank.csv')
test_Idx = test.Idx
test = test.drop('Idx',axis=1)
dtest = xgb.DMatrix(test)


train_test = pd.concat([train,test])
train_test.to_csv('rank_feature.csv',index=None)
print print(train_test.shape)

"""
params={
    	'booster':'gbtree',
    	'objective': 'rank:pairwise',
    	'scale_pos_weight': float(len(train_y)-sum(train_y))/float(sum(train_y)),
        'eval_metric': 'auc',
    	'gamma':0.1,
    	'max_depth':6,
    	'lambda':500,
        'subsample':0.6,
        'colsample_bytree':0.3,
        'min_child_weight':0.2, 
        'eta': 0.04,
    	'seed':1024,
    	'nthread':8
        }
xgb.cv(params,dtrain,num_boost_round=1100,nfold=10,metrics='auc',show_progress=3,seed=1024)#733

"""

def pipeline(iteration,random_seed,gamma,max_depth,lambd,subsample,colsample_bytree,min_child_weight):
    params={
            'booster':'gbtree',
	    'objective': 'rank:pairwise',
	    'scale_pos_weight': float(len(train_y)-sum(train_y))/float(sum(train_y)),
	    'eval_metric': 'auc',
	    'gamma':gamma,
	    'max_depth':max_depth,
	    'lambda':lambd,
	    'subsample':subsample,
	    'colsample_bytree':colsample_bytree,
	    'min_child_weight':min_child_weight, 
	    'eta': 0.2,
	    'seed':random_seed,
	    'nthread':8
	 }

    watchlist  = [(dtrain,'train')]
    model = xgb.train(params,dtrain,num_boost_round=700,evals=watchlist)
    #model.save_model('./model/xgb{0}.model'.format(iteration))
    #predict test set
    #test_y = model.predict(dtest)
    #test_result = pd.DataFrame(test_Idx,columns=["Idx"])
    #test_result["score"] = test_y
    #test_result.to_csv("./preds/xgb{0}.csv".format(iteration),index=None,encoding='utf-8')
    
    #save feature score
    feature_score = model.get_fscore()
    feature_score = sorted(feature_score.items(), key=lambda x:x[1],reverse=True)
    fs = []
    for (key,value) in feature_score:
        fs.append("{0},{1}\n".format(key,value))
    
    with open('./featurescore/feature_score_{0}.csv'.format(iteration),'w') as f:
        f.writelines("feature,score\n")
        f.writelines(fs)


if __name__ == "__main__":
    random_seed = range(10000,20000,100)
    gamma = [i/1000.0 for i in range(0,300,3)]
    max_depth = [5,6,7]
    lambd = range(400,600,2)
    subsample = [i/1000.0 for i in range(500,700,2)]
    colsample_bytree = [i/1000.0 for i in range(550,750,4)]
    min_child_weight = [i/1000.0 for i in range(250,550,3)]
    
    random.shuffle(random_seed)
    random.shuffle(gamma)
    random.shuffle(max_depth)
    random.shuffle(lambd)
    random.shuffle(subsample)
    random.shuffle(colsample_bytree)
    random.shuffle(min_child_weight)
    
    with open('params.pkl','w') as f:
        pickle.dump((random_seed,gamma,max_depth,lambd,subsample,colsample_bytree,min_child_weight),f)

    for i in range(36):
        pipeline(i,random_seed[i],gamma[i],max_depth[i%3],lambd[i],subsample[i],colsample_bytree[i],min_child_weight[i])

因为xgboost的参数选择非常重要，因此进行了参数shuffle的操作。最后可以基于以上不同参数组合的xgboost所得到的feature和socre，再进行score平均操作，筛选出高得分的特征。

import pandas as pd 
import os


files = os.listdir('featurescore')
fs = {}
for f in files:
    t = pd.read_csv('featurescore/'+f)
    t.index = t.feature
    t = t.drop(['feature'],axis=1)
    d = t.to_dict()['score']
    for key in d:
        if fs.has_key(key):
            fs[key] += d[key]
        else:
            fs[key] = d[key] 
            
fs = sorted(fs.items(), key=lambda x:x[1],reverse=True)

t = []
for (key,value) in fs:
    t.append("{0},{1}\n".format(key,value))

with open('rank_feature_score.csv','w') as f:
    f.writelines("feature,score\n")
    f.writelines(t)

这里得出了每个特征的总分，每个都除以36就是平均分了。最后按照平均分取出topn就可以。我的理解是这样子。

然后觉得这种方法太耗时了。

亚马逊云科技技术品牌专区

更多推荐

企业物联网平台如何选择？

亚马逊云科技技术品牌专区

STM32节点移植lorawan协议连接腾讯云物联网开发平台（IoT Explorer）

STM32移植lorawan协议连接腾讯云物联网开发平台（IoT Explorer）前言前言在移植协议之前，先给大家科普一下Lora 和 lorawan 的区别。LoRa 是LPWAN通信技术中的一种，是美国Semtech公司采用和推广的一种基于扩频技术的超远距离无线传输方案。这一方案改变了以往关于传输距离与功耗的折衷考虑方式为用户提供一种简单的能实现远距离、长电池寿命、大容量的系统，进而扩...

亚马逊云科技技术品牌专区

从华为的MQTT到TdEngineRPC，解读物联网时代的分布式

今天中秋节，笔者首先祝各位读者们中秋快乐，之所以在今天这个团圆节来谈分布式的话题，就是要聊聊物联网是如何通过MQTT连接各类终端，如何通过RPC整合各种数据的。下面就通过代码+动图的方式来解读一下华为LiteOS的MQTT与TD的RPC。MQTT协议MQTT是一个客户机服务器发布/订阅消息传输协议。它重量轻、开放、简单、易于实现。这些特性使其非常适合在物联网的低带宽、...