【数据挖掘】【笔记】模型集成之ensembling guide

KAGGLE ENSEMBLING GUIDEhttp://mlwave.com/kaggle-ensembling-guide

moshlwx

565人浏览 · 2017-06-25 21:20:31

moshlwx · 2017-06-25 21:20:31 发布

KAGGLE ENSEMBLING GUIDE

http://mlwave.com/kaggle-ensembling-guide/

creating ensembles from submission files

no need to retrain a model
quick
already existion model prediction
ideal when teaming up

voting ensembles

classification with metrics.accuracy_scroe

error correcting

错误率低，vote，少数服从多数

correlation

不相关携带更多信息
work better to ensemble low-correlated model predictions

weighting

give a better model more weight
避免democracy 平均

提升幅度有限 1%

averaging

classification and regression adn metrics.AUC, squared error, logatrimic loss
also called bagging
- reduces overfit
- average can reduces noisy impact
- a single poorly cross-validated
- overfitted submission may even bring you some gain through adding diversity (thus less correlation).

rank averaging

not all predictors are perfectly calibrated
average会使结果趋同，减少出现偏离程度大的预测
将结果排序然后增加差异

stacked generalization

a pool of base classifiers
another classifier to combine their predictions to reducing the generalization error

2-fold stacking:
1. Split the train set in 2 parts: train_a and train_b
1. Fit a first-stage model on train_a and create predictions for train_b
1. Fit the same model on train_b and create predictions for train_a
1. Finally fit the model on the entire train set and create predictions for the test set.
1. Now train a second-stage stacker model on the probabilities from the first-stage model(s).

A stacker model gets more information on the problem space by using the first-stage predictions as features, than if it was trained in isolation.

the level 0 generalizers should “span the space”.

The more each generalizer has to say (which isn’t duplicated in what the other generalizer’s have to say), the better the resultant stacked generalization.

creating out-of-fold predictions for the train set

blending

very close to stacked generalization, but a bit simpler and less risk of an information leak.

create a small holdout set of say 10% of the train set. The stacker model then trains on this holdout set only.

simpler
The generalizers and stackers use different data.
- wards against information leak
blender decides if it wants to keep that model or not.

cons:
- use less data overall
- final model may overfit holdout set
- CV not solid as stacking(calculater over more folds)

If you can not choose, you can always do both. Create stacked ensembles with stacked generalization and out-of-fold predictions. Then use a holdout set to further combine these models at a third stage.