Python ValueError: The number of columns in this dataset is different from the one used to fit this transformer (when using the fit() method)
Python ValueError: The number of columns in this dataset is different from the one used to fit this transformer (when using the fit() method)
我使用 sklearn 创建了一个简单的管道。我使用以下代码创建了数据拆分:
X_train, X_test, y_train, y_test = train_test_split(
df.drop(['selling_price'], axis=1),
df['selling_price'],
test_size=0.1,
random_state=0)
我设置了我的配置(要转换哪些变量等)并调用了 pipeline.fit(X_train, y_train)
。当我尝试预测或评分时,例如使用 pipeline.score(X_train, y_train)
,它 returns 是一个分数。但是,当我将任何其他变体传递到管道中时,例如 pipeline.score(X_test, y_test)
甚至 pipeline.score(X_train.head(10), y_train.head(10))
,我收到以下错误:
ValueError: The number of columns in this dataset is different from the one used to fit this transformer (when using the fit() method).
要清除以下内容:训练和测试拆分的列完全相同,在顺序、数据类型等方面。此外,行数X_train
和y_train
之间,以及X_test
和y_test
之间的行数是一致的。
完整代码(不包括导入):
# Load the dataset
df = pd.read_csv('car_prices.csv')
# Remove duplicates and NaN-values
df.drop_duplicates(inplace=True)
df.dropna(inplace=True)
# Convert selling price
df['selling_price'] = df['selling_price']/100
# Split data
X_train, X_test, y_train, y_test = train_test_split(
df.drop(['selling_price'], axis=1),
df['selling_price'],
test_size=0.1,
random_state=0
)
# Configuration
NAME_TO_BRAND = ['name']
NUMBER_OF_OWNERS = ['owner']
ENGINE_PROPERTIES = ['engine', 'max_power']
ONE_HOT_ENCODE = ['fuel', 'seller_type', 'transmission']
FEATURES = ['name', 'year', 'km_driven', 'fuel', 'transmission', 'owner', 'max_power', 'seats', 'seller_type', 'engine']
X_train = X_train[FEATURES]
X_test = X_test[FEATURES]
# Pipeline
pipeline = Pipeline([
# Transform variables
('transform_name_to_brand', pp.BrandTransformer(NAME_TO_BRAND)),
('transform_number_of_owners', pp.NumberOfOwnersTransformer(NUMBER_OF_OWNERS)),
('transform_engine_properties', pp.EnginePropertiesTransformer(ENGINE_PROPERTIES)),
# One hot encode categorical variables
('one_hot_encode', OneHotEncoder(variables=ONE_HOT_ENCODE)),
# Random Forest Regressor
('RFR', RandomForestRegressor(random_state=0)),
])
pipeline.fit(X_train, y_train)
# Evaluate model
car_pipeline.score(X_train, y_train) # returns 0.99
car_pipeline.score(X_test, y_test) # returns ValueError as specified above
car_pipeline.score(X_train.head(1), y_train.head(1)) # returns ValueError as specified above
问题已解决。在我的管道中,分类特征被单热编码。在我的训练集中,有 42 个独特的类别,这意味着这将在 one-hot 编码时产生 42 列。在我的测试集中,有 27 个独特的类别,在 one-hot 编码时产生 27 列。因此,引发了 ValueError。
我使用 sklearn 创建了一个简单的管道。我使用以下代码创建了数据拆分:
X_train, X_test, y_train, y_test = train_test_split(
df.drop(['selling_price'], axis=1),
df['selling_price'],
test_size=0.1,
random_state=0)
我设置了我的配置(要转换哪些变量等)并调用了 pipeline.fit(X_train, y_train)
。当我尝试预测或评分时,例如使用 pipeline.score(X_train, y_train)
,它 returns 是一个分数。但是,当我将任何其他变体传递到管道中时,例如 pipeline.score(X_test, y_test)
甚至 pipeline.score(X_train.head(10), y_train.head(10))
,我收到以下错误:
ValueError: The number of columns in this dataset is different from the one used to fit this transformer (when using the fit() method).
要清除以下内容:训练和测试拆分的列完全相同,在顺序、数据类型等方面。此外,行数X_train
和y_train
之间,以及X_test
和y_test
之间的行数是一致的。
完整代码(不包括导入):
# Load the dataset
df = pd.read_csv('car_prices.csv')
# Remove duplicates and NaN-values
df.drop_duplicates(inplace=True)
df.dropna(inplace=True)
# Convert selling price
df['selling_price'] = df['selling_price']/100
# Split data
X_train, X_test, y_train, y_test = train_test_split(
df.drop(['selling_price'], axis=1),
df['selling_price'],
test_size=0.1,
random_state=0
)
# Configuration
NAME_TO_BRAND = ['name']
NUMBER_OF_OWNERS = ['owner']
ENGINE_PROPERTIES = ['engine', 'max_power']
ONE_HOT_ENCODE = ['fuel', 'seller_type', 'transmission']
FEATURES = ['name', 'year', 'km_driven', 'fuel', 'transmission', 'owner', 'max_power', 'seats', 'seller_type', 'engine']
X_train = X_train[FEATURES]
X_test = X_test[FEATURES]
# Pipeline
pipeline = Pipeline([
# Transform variables
('transform_name_to_brand', pp.BrandTransformer(NAME_TO_BRAND)),
('transform_number_of_owners', pp.NumberOfOwnersTransformer(NUMBER_OF_OWNERS)),
('transform_engine_properties', pp.EnginePropertiesTransformer(ENGINE_PROPERTIES)),
# One hot encode categorical variables
('one_hot_encode', OneHotEncoder(variables=ONE_HOT_ENCODE)),
# Random Forest Regressor
('RFR', RandomForestRegressor(random_state=0)),
])
pipeline.fit(X_train, y_train)
# Evaluate model
car_pipeline.score(X_train, y_train) # returns 0.99
car_pipeline.score(X_test, y_test) # returns ValueError as specified above
car_pipeline.score(X_train.head(1), y_train.head(1)) # returns ValueError as specified above
问题已解决。在我的管道中,分类特征被单热编码。在我的训练集中,有 42 个独特的类别,这意味着这将在 one-hot 编码时产生 42 列。在我的测试集中,有 27 个独特的类别,在 one-hot 编码时产生 27 列。因此,引发了 ValueError。