This is my first machine learning project and the first time that I use ColumnTransformer. My aim is to perform two steps of data preprocessing, and use ColumnTransformer for each of them.
In the first step, I want to replace the missing values in my dataframe with the string 'missing_value' for some features, and the most frequent value for the remaining features. Therefore, I combine these two operations using ColumnTransformer and passing to it the corresponding columns of my dataframe.
In the second step, I want to use the just preprocessed data and apply OrdinalEncoder or OneHotEncoder depending on the features. For that I use again ColumnTransformer.
I then combine the two steps into a single pipeline.
I am using the Kaggle Houses Price dataset, I have scikit-learn version 0.20 and this is a simplified version of my code:
cat_columns_fill_miss = ['PoolQC', 'Alley']
cat_columns_fill_freq = ['Street', 'MSZoning', 'LandContour']
cat_columns_ord = ['Street', 'Alley', 'PoolQC']
ord_mapping = [['Pave', 'Grvl'], # Street
['missing_value', 'Pave', 'Grvl'], # Alley
['missing_value', 'Fa', 'TA', 'Gd', 'Ex'] # PoolQC
]
cat_columns_onehot = ['MSZoning', 'LandContour']
imputer_cat_pipeline = ColumnTransformer([
('imp_miss', SimpleImputer(strategy='constant'), cat_columns_fill_miss), # fill_value='missing_value' by default
('imp_freq', SimpleImputer(strategy='most_frequent'), cat_columns_fill_freq),
])
encoder_cat_pipeline = ColumnTransformer([
('ordinal', OrdinalEncoder(categories=ord_mapping), cat_columns_ord),
('pass_ord', OneHotEncoder(), cat_columns_onehot),
])
cat_pipeline = Pipeline([
('imp_cat', imputer_cat_pipeline),
('cat_encoder', encoder_cat_pipeline),
])
Unfortunately, when I apply it to housing_cat, the subset of my dataframe including only categorical features,
cat_pipeline.fit_transform(housing_cat)
I get the error:
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
During handling of the above exception, another exception occurred:
...
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
I have tried this simplified pipeline and it works properly:
new_cat_pipeline = Pipeline([ ('imp_cat', imputer_cat_pipeline), ('onehot', OneHotEncoder()), ])
However, if I try:
enc_one = ColumnTransformer([ ('onehot', OneHotEncoder(), cat_columns_onehot), ('pass_ord', 'passthrough', cat_columns_ord) ]) new_cat_pipeline = Pipeline([ ('imp_cat', imputer_cat_pipeline), ('onehot_encoder', enc_one), ])
I start to get the same error.
I suspect then that this error is related to the use of ColumnTransformer in the second step, but I do not actually understand where it comes from. The way I identify the columns in the second step is the same as in the first step, so it remains unclear to me why only in the second step I get the Attribute Error...
所有评论(0)