python pandas 回归标准化列

Question

我有以下 df:

Date       Event_Counts   Category_A  Category_B
20170401      982457          0           1
20170402      982754          1           0
20170402      875786          0           1

我正在为回归分析准备数据，并希望对列 Event_Counts 进行标准化，使其与类别的规模相似。

我使用以下代码：

from sklearn import preprocessing
df['scaled_event_counts'] = preprocessing.scale(df['Event_Counts'])

虽然我确实收到此警告：

DataConversionWarning: Data with input dtype int64 was converted to float64 by the scale function.
  warnings.warn(msg, _DataConversionWarning)

这似乎奏效了；有一个新专栏。但是，它有负数，如 -1.3

我认为比例函数的作用是从数字中减去平均值，然后除以每行的标准差；然后将结果的最小值添加到每一行。

这种方式对 pandas 不起作用吗？或者我应该使用 normalize() 函数还是 StandardScaler() 函数？我希望标准化列的比例为 0 到 1。

谢谢

Answer 1

缩放是通过减去每个特征（列）的平均值并除以标准差来完成的。所以，

scaled_event_counts = (Event_Counts - mean(Event_Counts)) / std(Event_Counts)

int64 到 float64 警告来自必须减去均值，这将是一个浮点数，而不仅仅是一个整数。

你将在缩放列中得到负数，因为平均值将归一化为零。

Answer 2

我认为您正在寻找 sklearn.preprocessing.MinMaxScaler。这将允许您扩展到给定范围。

所以在你的情况下会是：

scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
df['scaled_event_counts'] = scaler.fit_transform(df['Event_Counts'])

缩放整个 df:

scaled_df = scaler.fit_transform(df)
print(scaled_df)
[[ 0.          0.99722347  0.          1.        ]
 [ 1.          1.          1.          0.        ]
 [ 1.          0.          0.          1.        ]]

python pandas 回归标准化列

python pandas standardize column for regression

python

normalize

standardized

scale

pandas