标准化 Python Pandas 数据框中的某些列?
Standardize some columns in Python Pandas dataframe?
Python下面的代码只是return我一个数组,但是我想要缩放后的数据来替换原来的数据。
from sklearn.preprocessing import StandardScaler
df = StandardScaler().fit_transform(df[['cost', 'sales']])
df
输出
array([[ 1.99987622, -0.55900276],
[-0.49786658, -0.45658181],
[-0.5146864 , -0.505097 ],
[-0.48104676, -0.47814412],
[-0.50627649, 1.9988257 ]])
原始数据
id cost sales item
1 300 50 pen
2 3 88 bottle
3 1 70 drink
4 5 80 cup
5 2 999 ink
只需将其分配回去
df[['cost', 'sales']] = StandardScaler().fit_transform(df[['cost', 'sales']])
df
Out[45]:
id cost sales item
0 1 1.999876 -0.559003 pen
1 2 -0.497867 -0.456582 bottle
2 3 -0.514686 -0.505097 drink
3 4 -0.481047 -0.478144 cup
4 5 -0.506276 1.998826 ink
或者如果使用 列索引 而不是列名:
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.DataFrame({"cost": [300,3,1,5,2], "sales": [50,88,70,80,999], "item": ["pen","bottle","drink","cup","ink"]})
# Scale selected columns by index
df.iloc[:, 0:2] = StandardScaler().fit_transform(df.iloc[:, 0:2])
cost sales item
0 1.999876 -0.559003 pen
1 -0.497867 -0.456582 bottle
2 -0.514686 -0.505097 drink
3 -0.481047 -0.478144 cup
4 -0.506276 1.998826 ink
还可以保存 sclaer 对象以便在现有缩放器的基础上缩放“新数据”:
df = pd.DataFrame({"cost": [300,3,1,5,2], "sales": [50,88,70,80,999], "item": ["pen","bottle","drink","cup","ink"]})
df_new = pd.DataFrame({"cost": [299,5,12,64,2], "sales": [55,99,48,20,999], "item": ["pen","bottle","drink","cup","ink"]})
# Set up scaler
scaler = StandardScaler().fit(df.iloc[:, 0:2])
# Scale original data
df.iloc[:, 0:2] = scaler.transform(df.iloc[:, 0:2])
# Scale new data
df_new.iloc[:, 0:2] = scaler.transform(df_new.iloc[:, 0:2])
如果你想拥有benefits of an sklearn Pipeline(convenience/encapsulation、关节参数选择、防泄漏安全),你可以使用ColumnTransformer
:
preproc = ColumnTransformer(
transformers=[
('scale', StandardScaler(), ["cost", "sales"]),
],
remainder="passthrough",
)
(有几种方法可以指定哪些列进入缩放器,检查 the docs). Now you have the benefit of saving the scaler object as ,但您也不必一直重复切片:
df = preproc.fit_transform(df)
df_new = preproc.transform(df)
您可以使用 scale
来标准化特定列:
from sklearn.preprocessing import scale
cols = ['cost', 'sales']
df[cols] = scale(df[cols])
scale
减去平均值并除以每列的样本标准偏差。
例子
# Prep
import pandas as pd
import numpy as np
from sklearn.preprocessing import scale
# Sample data
df = pd.DataFrame({
'cost':[300, 3, 1, 5, 2],
'sales':[50, 88, 70, 80, 999],
'item': ['pen', 'bottle', 'drink', 'cup', 'ink']
})
# Standardize columns
cols = ['cost', 'sales']
df[cols] = scale(df[cols])
Python下面的代码只是return我一个数组,但是我想要缩放后的数据来替换原来的数据。
from sklearn.preprocessing import StandardScaler
df = StandardScaler().fit_transform(df[['cost', 'sales']])
df
输出
array([[ 1.99987622, -0.55900276],
[-0.49786658, -0.45658181],
[-0.5146864 , -0.505097 ],
[-0.48104676, -0.47814412],
[-0.50627649, 1.9988257 ]])
原始数据
id cost sales item
1 300 50 pen
2 3 88 bottle
3 1 70 drink
4 5 80 cup
5 2 999 ink
只需将其分配回去
df[['cost', 'sales']] = StandardScaler().fit_transform(df[['cost', 'sales']])
df
Out[45]:
id cost sales item
0 1 1.999876 -0.559003 pen
1 2 -0.497867 -0.456582 bottle
2 3 -0.514686 -0.505097 drink
3 4 -0.481047 -0.478144 cup
4 5 -0.506276 1.998826 ink
或者如果使用 列索引 而不是列名:
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.DataFrame({"cost": [300,3,1,5,2], "sales": [50,88,70,80,999], "item": ["pen","bottle","drink","cup","ink"]})
# Scale selected columns by index
df.iloc[:, 0:2] = StandardScaler().fit_transform(df.iloc[:, 0:2])
cost sales item
0 1.999876 -0.559003 pen
1 -0.497867 -0.456582 bottle
2 -0.514686 -0.505097 drink
3 -0.481047 -0.478144 cup
4 -0.506276 1.998826 ink
还可以保存 sclaer 对象以便在现有缩放器的基础上缩放“新数据”:
df = pd.DataFrame({"cost": [300,3,1,5,2], "sales": [50,88,70,80,999], "item": ["pen","bottle","drink","cup","ink"]})
df_new = pd.DataFrame({"cost": [299,5,12,64,2], "sales": [55,99,48,20,999], "item": ["pen","bottle","drink","cup","ink"]})
# Set up scaler
scaler = StandardScaler().fit(df.iloc[:, 0:2])
# Scale original data
df.iloc[:, 0:2] = scaler.transform(df.iloc[:, 0:2])
# Scale new data
df_new.iloc[:, 0:2] = scaler.transform(df_new.iloc[:, 0:2])
如果你想拥有benefits of an sklearn Pipeline(convenience/encapsulation、关节参数选择、防泄漏安全),你可以使用ColumnTransformer
:
preproc = ColumnTransformer(
transformers=[
('scale', StandardScaler(), ["cost", "sales"]),
],
remainder="passthrough",
)
(有几种方法可以指定哪些列进入缩放器,检查 the docs). Now you have the benefit of saving the scaler object as
df = preproc.fit_transform(df)
df_new = preproc.transform(df)
您可以使用 scale
来标准化特定列:
from sklearn.preprocessing import scale
cols = ['cost', 'sales']
df[cols] = scale(df[cols])
scale
减去平均值并除以每列的样本标准偏差。
例子
# Prep
import pandas as pd
import numpy as np
from sklearn.preprocessing import scale
# Sample data
df = pd.DataFrame({
'cost':[300, 3, 1, 5, 2],
'sales':[50, 88, 70, 80, 999],
'item': ['pen', 'bottle', 'drink', 'cup', 'ink']
})
# Standardize columns
cols = ['cost', 'sales']
df[cols] = scale(df[cols])