使用 StandardScaler 仅标准化数值特征
Standardize only numerical features with StandardScaler
我有以下数据集:
df=pd.read_csv('https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/data/HR_comma_sep.csv')
我先用标签编码器le_salary
编码salary
,然后用序数编码器oe_salary
编码。然后我用 OneHotEncoder ohe_department
编码 department
。我把它全部连接起来,现在有一个 concat_df
。
现在我想做一个逻辑回归,但要标准化,这就是我遇到问题的地方。
这是我的价值观和 train/test 分裂:
X=concat_df[[ 'satisfaction_level', 'last_evaluation', 'number_project', 'average_monthly_hours', 'time_spent_company', 'work_accident', 'promotion_last_5years', ('IT',), ('RandD',), ('accounting',), ('hr',), ('management',), ('marketing',), ('product_mng',), ('sales',), ('support',), ('technical',), 'oe_salary', 'eval_spent']].values
y=concat_df["left"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)
然后我尝试使用以下代码仅标准化数值:
from sklearn.compose import ColumnTransformer
scaler = StandardScaler()
#select cols to standardize
Cols = ['satisfaction_level', 'last_evaluation', 'number_project', 'average_monthly_hours', 'time_spent_company', 'eval_spent']
#set up preprocessor
preprocessor = ColumnTransformer([('standard', scaler, Cols)], remainder = 'passthrough')
#fit preprocessor
X_train_std = preprocessor.fit_transform(X_train)
X_test_std = preprocessor.transform(X_test)
但是我得到了以下错误,我没有理解,因为我之前已经标准化了,没有任何问题。
AttributeError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
408 try:
--> 409 all_columns = X.columns
410 except AttributeError:
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
3 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
410 except AttributeError:
411 raise ValueError(
--> 412 "Specifying the columns using strings is only "
413 "supported for pandas DataFrames"
414 )
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
为什么会出现此错误,这是什么意思?
像这样将 .values
删除到 DataFrame 中:
X=concat_df[[ 'satisfaction_level', 'last_evaluation', 'number_project', 'average_monthly_hours', 'time_spent_company', 'work_accident', 'promotion_last_5years', ('IT',), ('RandD',), ('accounting',), ('hr',), ('management',), ('marketing',), ('product_mng',), ('sales',), ('support',), ('technical',), 'oe_salary', 'eval_spent']]
y=concat_df["left"]
我们应该能够保留 DataFrame 格式并使用列名调用它们。
此外,要删除那些关于列名的警告,我们可以通过在开始时执行以下操作来修改它们:
concat_df.columns = ['satisfaction_level',
'last_evaluation',
'number_project',
'average_monthly_hours',
'time_spent_company',
'work_accident',
'promotion_last_5years',
'IT',
'RandD',
'accounting',
'hr',
'management',
'marketing',
'product_mng',
'sales',
'support',
'technical',
'oe_salary',
'eval_spent',
'left']
然后我们可以调用新的列名:
X=concat_df[['satisfaction_level',
'last_evaluation',
'number_project',
'average_monthly_hours',
'time_spent_company',
'work_accident',
'promotion_last_5years',
'IT',
'RandD',
'accounting',
'hr',
'management',
'marketing',
'product_mng',
'sales',
'support',
'technical',
'oe_salary',
'eval_spent']]]
y=concat_df["left"]
我有以下数据集:
df=pd.read_csv('https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/data/HR_comma_sep.csv')
我先用标签编码器le_salary
编码salary
,然后用序数编码器oe_salary
编码。然后我用 OneHotEncoder ohe_department
编码 department
。我把它全部连接起来,现在有一个 concat_df
。
现在我想做一个逻辑回归,但要标准化,这就是我遇到问题的地方。
这是我的价值观和 train/test 分裂:
X=concat_df[[ 'satisfaction_level', 'last_evaluation', 'number_project', 'average_monthly_hours', 'time_spent_company', 'work_accident', 'promotion_last_5years', ('IT',), ('RandD',), ('accounting',), ('hr',), ('management',), ('marketing',), ('product_mng',), ('sales',), ('support',), ('technical',), 'oe_salary', 'eval_spent']].values
y=concat_df["left"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)
然后我尝试使用以下代码仅标准化数值:
from sklearn.compose import ColumnTransformer
scaler = StandardScaler()
#select cols to standardize
Cols = ['satisfaction_level', 'last_evaluation', 'number_project', 'average_monthly_hours', 'time_spent_company', 'eval_spent']
#set up preprocessor
preprocessor = ColumnTransformer([('standard', scaler, Cols)], remainder = 'passthrough')
#fit preprocessor
X_train_std = preprocessor.fit_transform(X_train)
X_test_std = preprocessor.transform(X_test)
但是我得到了以下错误,我没有理解,因为我之前已经标准化了,没有任何问题。
AttributeError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
408 try:
--> 409 all_columns = X.columns
410 except AttributeError:
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
3 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
410 except AttributeError:
411 raise ValueError(
--> 412 "Specifying the columns using strings is only "
413 "supported for pandas DataFrames"
414 )
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
为什么会出现此错误,这是什么意思?
像这样将 .values
删除到 DataFrame 中:
X=concat_df[[ 'satisfaction_level', 'last_evaluation', 'number_project', 'average_monthly_hours', 'time_spent_company', 'work_accident', 'promotion_last_5years', ('IT',), ('RandD',), ('accounting',), ('hr',), ('management',), ('marketing',), ('product_mng',), ('sales',), ('support',), ('technical',), 'oe_salary', 'eval_spent']]
y=concat_df["left"]
我们应该能够保留 DataFrame 格式并使用列名调用它们。
此外,要删除那些关于列名的警告,我们可以通过在开始时执行以下操作来修改它们:
concat_df.columns = ['satisfaction_level',
'last_evaluation',
'number_project',
'average_monthly_hours',
'time_spent_company',
'work_accident',
'promotion_last_5years',
'IT',
'RandD',
'accounting',
'hr',
'management',
'marketing',
'product_mng',
'sales',
'support',
'technical',
'oe_salary',
'eval_spent',
'left']
然后我们可以调用新的列名:
X=concat_df[['satisfaction_level',
'last_evaluation',
'number_project',
'average_monthly_hours',
'time_spent_company',
'work_accident',
'promotion_last_5years',
'IT',
'RandD',
'accounting',
'hr',
'management',
'marketing',
'product_mng',
'sales',
'support',
'technical',
'oe_salary',
'eval_spent']]]
y=concat_df["left"]