Python ValueError: The number of columns in this dataset is different from the one used to fit this transformer (when using the fit() method)

Question

我使用 sklearn 创建了一个简单的管道。我使用以下代码创建了数据拆分：

X_train, X_test, y_train, y_test = train_test_split(
df.drop(['selling_price'], axis=1),
df['selling_price'],
test_size=0.1,
random_state=0)

我设置了我的配置（要转换哪些变量等）并调用了 pipeline.fit(X_train, y_train)。当我尝试预测或评分时，例如使用 pipeline.score(X_train, y_train)，它 returns 是一个分数。但是，当我将任何其他变体传递到管道中时，例如 pipeline.score(X_test, y_test) 甚至 pipeline.score(X_train.head(10), y_train.head(10))，我收到以下错误：

ValueError: The number of columns in this dataset is different from the one used to fit this transformer (when using the fit() method).

要清除以下内容：训练和测试拆分的列完全相同，在顺序、数据类型等方面。此外，行数X_train和y_train之间，以及X_test和y_test之间的行数是一致的。

完整代码（不包括导入）：

# Load the dataset
df = pd.read_csv('car_prices.csv')

# Remove duplicates and NaN-values
df.drop_duplicates(inplace=True)
df.dropna(inplace=True)

# Convert selling price
df['selling_price'] = df['selling_price']/100

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['selling_price'], axis=1),
    df['selling_price'],
    test_size=0.1,
    random_state=0
)

# Configuration
NAME_TO_BRAND = ['name']
NUMBER_OF_OWNERS = ['owner']
ENGINE_PROPERTIES = ['engine', 'max_power']

ONE_HOT_ENCODE = ['fuel', 'seller_type', 'transmission']

FEATURES = ['name', 'year', 'km_driven', 'fuel', 'transmission', 'owner', 'max_power', 'seats', 'seller_type', 'engine']

X_train = X_train[FEATURES]
X_test = X_test[FEATURES]

# Pipeline
pipeline = Pipeline([
    
    # Transform variables
    ('transform_name_to_brand', pp.BrandTransformer(NAME_TO_BRAND)),
    ('transform_number_of_owners', pp.NumberOfOwnersTransformer(NUMBER_OF_OWNERS)),
    ('transform_engine_properties', pp.EnginePropertiesTransformer(ENGINE_PROPERTIES)),
    
    # One hot encode categorical variables
    ('one_hot_encode', OneHotEncoder(variables=ONE_HOT_ENCODE)),
    
    # Random Forest Regressor
    ('RFR', RandomForestRegressor(random_state=0)),
])

pipeline.fit(X_train, y_train)

# Evaluate model
car_pipeline.score(X_train, y_train) # returns 0.99
car_pipeline.score(X_test, y_test) # returns ValueError as specified above   
car_pipeline.score(X_train.head(1), y_train.head(1)) # returns ValueError as specified above

Answer 1

问题已解决。在我的管道中，分类特征被单热编码。在我的训练集中，有 42 个独特的类别，这意味着这将在 one-hot 编码时产生 42 列。在我的测试集中，有 27 个独特的类别，在 one-hot 编码时产生 27 列。因此，引发了 ValueError。

Python ValueError: The number of columns in this dataset is different from the one used to fit this transformer (when using the fit() method)

Python ValueError: The number of columns in this dataset is different from the one used to fit this transformer (when using the fit() method)

python

pandas

scikit-learn

valueerror