将 StandardScaler 应用于数据集的各个部分
Apply StandardScaler to parts of a data set
我想使用 sklearn
的 StandardScaler
。是否可以将其应用于某些特征列而不应用于其他特征列?
例如,假设我的 data
是:
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})
Age Name Weight
0 18 3 68
1 92 4 59
2 98 6 49
col_names = ['Name', 'Age', 'Weight']
features = data[col_names]
我适合并改造 data
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features = pd.DataFrame(features, columns = col_names)
Name Age Weight
0 -1.069045 -1.411004 1.202703
1 -0.267261 0.623041 0.042954
2 1.336306 0.787964 -1.245657
当然,这些名称并不是真正的整数,而是字符串,我不想将它们标准化。如何仅在 Age
和 Weight
列上应用 fit
和 transform
方法?
更新:
目前处理此问题的最佳方法是按照说明使用 ColumnTransformer 。
首先创建数据框的副本:
scaled_features = data.copy()
不要在转换中包含名称列:
col_names = ['Age', 'Weight']
features = scaled_features[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
现在,不要创建新数据框,而是将结果分配给这两列:
scaled_features[col_names] = features
print(scaled_features)
Age Name Weight
0 -1.411004 3 1.202703
1 0.623041 4 0.042954
2 0.787964 6 -1.245657
一种更 pythonic 的方式来做到这一点 -
from sklearn.preprocessing import StandardScaler
data[['Age','Weight']] = data[['Age','Weight']].apply(
lambda x: StandardScaler().fit_transform(x))
data
输出 -
Age Name Weight
0 -1.411004 3 1.202703
1 0.623041 4 0.042954
2 0.787964 6 -1.245657
另一种选择是在缩放之前删除名称列,然后将其合并回去:
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})
from sklearn.preprocessing import StandardScaler
# Save the variable you don't want to scale
name_var = data['Name']
# Fit scaler to your data
scaler.fit(data.drop('Name', axis = 1))
# Calculate scaled values and store them in a separate object
scaled_values = scaler.transform(data.drop('Name', axis = 1))
data = pd.DataFrame(scaled_values, index = data.index, columns = data.drop('ID', axis = 1).columns)
data['Name'] = name_var
print(data)
在 v0.20 中引入的是 ColumnTransformer,它将变换器应用于数组或 pandas DataFrame 的一组指定列。
import pandas as pd
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})
col_names = ['Name', 'Age', 'Weight']
features = data[col_names]
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
ct = ColumnTransformer([
('somename', StandardScaler(), ['Age', 'Weight'])
], remainder='passthrough')
ct.fit_transform(features)
注意:与 Pipeline 一样,它也有一个 shorthand 版本 make_column_transformer,不需要命名转换器
输出
-1.41100443, 1.20270298, 3.
0.62304092, 0.04295368, 4.
0.78796352, -1.24565666, 6.
我找到的最简单的方法是:
from sklearn.preprocessing import StandardScaler
# I'm selecting only numericals to scale
numerical = temp.select_dtypes(include='float64').columns
# This will transform the selected columns and merge to the original data frame
temp.loc[:,numerical] = StandardScaler().fit_transform(temp.loc[:,numerical])
输出
Age Name Weight
0 -1.411004 3 1.202703
1 0.623041 4 0.042954
2 0.787964 6 -1.245657
聚会迟到了,但这是我的首选解决方案:
#load data
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})
#list for cols to scale
cols_to_scale = ['Age','Weight']
#create and fit scaler
scaler = StandardScaler()
scaler.fit(data[cols_to_scale])
#scale selected data
data[cols_to_scale] = scaler.transform(data[cols_to_scale])
我想使用 sklearn
的 StandardScaler
。是否可以将其应用于某些特征列而不应用于其他特征列?
例如,假设我的 data
是:
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})
Age Name Weight
0 18 3 68
1 92 4 59
2 98 6 49
col_names = ['Name', 'Age', 'Weight']
features = data[col_names]
我适合并改造 data
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features = pd.DataFrame(features, columns = col_names)
Name Age Weight
0 -1.069045 -1.411004 1.202703
1 -0.267261 0.623041 0.042954
2 1.336306 0.787964 -1.245657
当然,这些名称并不是真正的整数,而是字符串,我不想将它们标准化。如何仅在 Age
和 Weight
列上应用 fit
和 transform
方法?
更新:
目前处理此问题的最佳方法是按照说明使用 ColumnTransformer
首先创建数据框的副本:
scaled_features = data.copy()
不要在转换中包含名称列:
col_names = ['Age', 'Weight']
features = scaled_features[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
现在,不要创建新数据框,而是将结果分配给这两列:
scaled_features[col_names] = features
print(scaled_features)
Age Name Weight
0 -1.411004 3 1.202703
1 0.623041 4 0.042954
2 0.787964 6 -1.245657
一种更 pythonic 的方式来做到这一点 -
from sklearn.preprocessing import StandardScaler
data[['Age','Weight']] = data[['Age','Weight']].apply(
lambda x: StandardScaler().fit_transform(x))
data
输出 -
Age Name Weight
0 -1.411004 3 1.202703
1 0.623041 4 0.042954
2 0.787964 6 -1.245657
另一种选择是在缩放之前删除名称列,然后将其合并回去:
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})
from sklearn.preprocessing import StandardScaler
# Save the variable you don't want to scale
name_var = data['Name']
# Fit scaler to your data
scaler.fit(data.drop('Name', axis = 1))
# Calculate scaled values and store them in a separate object
scaled_values = scaler.transform(data.drop('Name', axis = 1))
data = pd.DataFrame(scaled_values, index = data.index, columns = data.drop('ID', axis = 1).columns)
data['Name'] = name_var
print(data)
在 v0.20 中引入的是 ColumnTransformer,它将变换器应用于数组或 pandas DataFrame 的一组指定列。
import pandas as pd
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})
col_names = ['Name', 'Age', 'Weight']
features = data[col_names]
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
ct = ColumnTransformer([
('somename', StandardScaler(), ['Age', 'Weight'])
], remainder='passthrough')
ct.fit_transform(features)
注意:与 Pipeline 一样,它也有一个 shorthand 版本 make_column_transformer,不需要命名转换器
输出
-1.41100443, 1.20270298, 3.
0.62304092, 0.04295368, 4.
0.78796352, -1.24565666, 6.
我找到的最简单的方法是:
from sklearn.preprocessing import StandardScaler
# I'm selecting only numericals to scale
numerical = temp.select_dtypes(include='float64').columns
# This will transform the selected columns and merge to the original data frame
temp.loc[:,numerical] = StandardScaler().fit_transform(temp.loc[:,numerical])
输出
Age Name Weight
0 -1.411004 3 1.202703
1 0.623041 4 0.042954
2 0.787964 6 -1.245657
聚会迟到了,但这是我的首选解决方案:
#load data
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})
#list for cols to scale
cols_to_scale = ['Age','Weight']
#create and fit scaler
scaler = StandardScaler()
scaler.fit(data[cols_to_scale])
#scale selected data
data[cols_to_scale] = scaler.transform(data[cols_to_scale])