应用一次热编码后,数据和输入测试集的 n 个特征不同
n features for data and the input test set are different after applying one hot encoding
我一直在尝试训练 RandomForestRegressor 以根据给定的训练集预测给定测试集的房屋数据价格。
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MaxAbsScaler
file='file:///F:/Download sort required/train.csv'
data=pd.read_csv(file)
data.dropna(axis=0,subset=['SalePrice'],inplace=True)
y=data.SalePrice
predictors=['LotArea','OverallQual','GrLivArea','GarageCars','TotRmsAbvGrd','Neighborhood','HouseStyle','YearBuilt','ExterQual','KitchenQual']
One_hot_encoded_predictors=['Neighborhood','HouseStyle','YearBuilt','ExterQual','KitchenQual']
X_uncoded=data[predictors]
#Encoding the training data
X_uncoded=pd.get_dummies(X_uncoded,columns=One_hot_encoded_predictors)
X=X_uncoded
maxabsscaler=MaxAbsScaler()
X_max_abs=maxabsscaler.fit_transform(X)
model=RandomForestRegressor()
model.fit(X_max_abs,y)
test_file='file:///C:/Users/shand/Downloads/test.csv'
test_data=pd.read_csv(test_file)
X_uncoded_test=test_data[predictors]
X_uncoded_test=pd.get_dummies(X_uncoded_test,columns=One_hot_encoded_predictors)
X_test=X_uncoded_test
X_test.fillna(X_test.mean(),inplace=True)
X_max_abs_test=maxabsscaler.fit_transform(X_test)
predicted_prices=model.predict(X_max_abs_test)
my_submission = pd.DataFrame({'Id': test_data.Id, 'SalePrice': predicted_prices})
my_submission.to_csv('submission.csv', index=False)
我在分类特征上应用了一种热编码,然后进行了 maxabsscaler 转换,因为大多数数据在 -1 到 1 或 0 到 1 之间变化。但是编译时的代码抛出以下错误-
> > 28 X_test.fillna(X_test.mean(),inplace=True)
> 29 X_max_abs_test=maxabsscaler.fit_transform(X_test)
> ---> 30 predicted_prices=model.predict(X_max_abs_test)
> 31 my_submission = pd.DataFrame({'Id': test_data.Id, 'SalePrice': predicted_prices})
> 32 my_submission.to_csv('submission.csv', index=False)
>
> C:\Users\shand\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py
> in predict(self, X)
> 683 """
> 684 # Check data
> --> 685 X = self._validate_X_predict(X)
> 686
> 687 # Assign chunk of trees to jobs
>
> C:\Users\shand\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py
> in _validate_X_predict(self, X)
> 353 "call `fit` before exploiting the model.")
> 354
> --> 355 return self.estimators_[0]._validate_X_predict(X, check_input=True)
> 356
> 357 @property
>
> C:\Users\shand\Anaconda3\lib\site-packages\sklearn\tree\tree.py in
> _validate_X_predict(self, X, check_input)
> 374 "match the input. Model n_features is %s and "
> 375 "input n_features is %s "
> --> 376 % (self.n_features_, n_features))
> 377
> 378 return X
>
> ValueError: Number of features of the model must match the input.
> Model n_features is 158 and input n_features is 151
在应用 one hot 编码和 maxabsscaler 后,有 158 个特征用于训练模型。
谁能解释为什么我会收到此错误,尽管我对训练集和测试集数据应用了相同的转换?
我应该怎么做才能纠正这个错误?
PS-数据来自-
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
正如您提到的,编码后训练数据和测试数据的列数不同。
训练数据有 158 列,而测试数据只有 151 列。
#Encoding the train data
X_uncoded=pd.get_dummies(X_uncoded,columns=One_hot_encoded_predictors)
X=X_uncoded
print(X.shape)
(1460, 158)
#Encoding the test data
X_uncoded_test=pd.get_dummies(X_uncoded_test,columns=One_hot_encoded_predictors)
print(X_uncoded_test.shape)
(1459, 151)
这可能是因为测试数据的级别数少于训练数据。请参阅下面来自 pandas.get_dummies
的示例
import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
您可以考虑在编码之前组合训练和测试,然后按照所述在编码之后将它们分离回训练和测试here
我一直在尝试训练 RandomForestRegressor 以根据给定的训练集预测给定测试集的房屋数据价格。
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MaxAbsScaler
file='file:///F:/Download sort required/train.csv'
data=pd.read_csv(file)
data.dropna(axis=0,subset=['SalePrice'],inplace=True)
y=data.SalePrice
predictors=['LotArea','OverallQual','GrLivArea','GarageCars','TotRmsAbvGrd','Neighborhood','HouseStyle','YearBuilt','ExterQual','KitchenQual']
One_hot_encoded_predictors=['Neighborhood','HouseStyle','YearBuilt','ExterQual','KitchenQual']
X_uncoded=data[predictors]
#Encoding the training data
X_uncoded=pd.get_dummies(X_uncoded,columns=One_hot_encoded_predictors)
X=X_uncoded
maxabsscaler=MaxAbsScaler()
X_max_abs=maxabsscaler.fit_transform(X)
model=RandomForestRegressor()
model.fit(X_max_abs,y)
test_file='file:///C:/Users/shand/Downloads/test.csv'
test_data=pd.read_csv(test_file)
X_uncoded_test=test_data[predictors]
X_uncoded_test=pd.get_dummies(X_uncoded_test,columns=One_hot_encoded_predictors)
X_test=X_uncoded_test
X_test.fillna(X_test.mean(),inplace=True)
X_max_abs_test=maxabsscaler.fit_transform(X_test)
predicted_prices=model.predict(X_max_abs_test)
my_submission = pd.DataFrame({'Id': test_data.Id, 'SalePrice': predicted_prices})
my_submission.to_csv('submission.csv', index=False)
我在分类特征上应用了一种热编码,然后进行了 maxabsscaler 转换,因为大多数数据在 -1 到 1 或 0 到 1 之间变化。但是编译时的代码抛出以下错误-
> > 28 X_test.fillna(X_test.mean(),inplace=True)
> 29 X_max_abs_test=maxabsscaler.fit_transform(X_test)
> ---> 30 predicted_prices=model.predict(X_max_abs_test)
> 31 my_submission = pd.DataFrame({'Id': test_data.Id, 'SalePrice': predicted_prices})
> 32 my_submission.to_csv('submission.csv', index=False)
>
> C:\Users\shand\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py
> in predict(self, X)
> 683 """
> 684 # Check data
> --> 685 X = self._validate_X_predict(X)
> 686
> 687 # Assign chunk of trees to jobs
>
> C:\Users\shand\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py
> in _validate_X_predict(self, X)
> 353 "call `fit` before exploiting the model.")
> 354
> --> 355 return self.estimators_[0]._validate_X_predict(X, check_input=True)
> 356
> 357 @property
>
> C:\Users\shand\Anaconda3\lib\site-packages\sklearn\tree\tree.py in
> _validate_X_predict(self, X, check_input)
> 374 "match the input. Model n_features is %s and "
> 375 "input n_features is %s "
> --> 376 % (self.n_features_, n_features))
> 377
> 378 return X
>
> ValueError: Number of features of the model must match the input.
> Model n_features is 158 and input n_features is 151
在应用 one hot 编码和 maxabsscaler 后,有 158 个特征用于训练模型。 谁能解释为什么我会收到此错误,尽管我对训练集和测试集数据应用了相同的转换? 我应该怎么做才能纠正这个错误?
PS-数据来自- https://www.kaggle.com/c/house-prices-advanced-regression-techniques
正如您提到的,编码后训练数据和测试数据的列数不同。 训练数据有 158 列,而测试数据只有 151 列。
#Encoding the train data
X_uncoded=pd.get_dummies(X_uncoded,columns=One_hot_encoded_predictors)
X=X_uncoded
print(X.shape)
(1460, 158)
#Encoding the test data
X_uncoded_test=pd.get_dummies(X_uncoded_test,columns=One_hot_encoded_predictors)
print(X_uncoded_test.shape)
(1459, 151)
这可能是因为测试数据的级别数少于训练数据。请参阅下面来自 pandas.get_dummies
的示例import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
您可以考虑在编码之前组合训练和测试,然后按照所述在编码之后将它们分离回训练和测试here