为什么仅应用于某些列的 MinMaxScaler 不会规范化我的数据框？

Question

我需要规范化数据集中的列，避免规范化某些已经具有小值且标准差低于 1 的列。我想要规范化的所有列都存储在 columns_to_normalize 列表。执行以下代码仍然对规范化过程没有帮助：

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from pandas import DataFrame
# create scaler
minmax_transformer = Pipeline(steps=[
        ('minmax', MinMaxScaler())])

# perform normalization on the dataset, avoiding ordinal columns
preprocessor = ColumnTransformer(
        remainder='passthrough', 
        transformers=[
            ('mm', minmax_transformer , columns_to_normalize)
        ])

# fit and transform model on data
df_norm_values = preprocessor.fit_transform(df)

# convert the array back to a dataframe
df_norm = DataFrame(df_norm_values)

# set columns' names
column_names = list(df.columns)
df_norm.columns = column_names
# normalized input variable's summarry
df_norm.describe()

例如，最后两列没有完全归一化，因为最小值为 0.00 和 1.00，最大值为 3.00 和 4.00），我不明白为什么我的代码没有成功。

Answer 1

缩放器工作正常，但 ColumnTransformer 正在更改列的顺序：

来自ColumnTransformer's documentation :

The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.

这是一个快速修复方法：

minmax_transformer = Pipeline(steps=[
        ('minmax', MinMaxScaler())])

# perform normalization on the dataset, avoiding ordinal columns
preprocessor = ColumnTransformer(
        remainder='passthrough', 
        transformers=[
            ('mm', minmax_transformer , columns_to_normalize)
        ])

# fit and transform model on data
df_norm_values = preprocessor.fit_transform(df)

# convert the array back to a dataframe
df_norm = DataFrame(df_norm_values)

# set columns' names
passthrough_cols = list(df.columns)
for col in columns_to_normalize: # remove cols that will be scaled
    passthrough_cols.remove(col)

column_names = columns_to_normalize
column_names.extend(passthrough_cols) # stack columns names

df_norm.columns = column_names
# normalized input variable's summarry
df_norm.describe()

为什么仅应用于某些列的 MinMaxScaler 不会规范化我的数据框？

Why MinMaxScaler applied only to certain columns doesn't normalize my dataframe?

python

machine-learning

data-mining

data-analysis

dataframe