Pandas 逻辑回归的数据集特征顺序错误
Pandas dataset features in wrong order for logistic regression
我的训练和测试数据集的 features/variables 最初是有序的并且名称匹配,但是在我使用 .get_dummies() 方法将我的分类变量转换为二进制变量后 运行 逻辑回归,它会导致排序问题。导致问题的分类变量是 'Dependents' 特征,它是“1”、“2”或“3”。 get_dummies() 方法创建了 3 个不同的特征('Dependents_0'、'Dependents_1'、'Dependents_2' 和 'Dependents_3')。
在火车数据集中,出于某种原因,它是这样排序的:'Dependents_3'、'Dependents_0'、'Dependents_1'、'Dependents_2'
测试数据集的顺序正确。因此,我认为它在尝试 运行 测试数据集上的模型时会导致问题,因为我收到警告:
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.
warnings.warn(message, FutureWarning)
调用get_dummies()方法后数据集的其他信息:
=> train_ds.dtypes
ApplicantIncome int64
CoapplicantIncome float64
LoanAmount float64
Loan_Amount_Term float64
Credit_History float64
Loan_Status int64
Gender_Female uint8
Gender_Male uint8
Married_No uint8
Married_Yes uint8
Dependents_3 uint8
Dependents_0 uint8
Dependents_1 uint8
Dependents_2 uint8
Education_Graduate uint8
Education_Not Graduate uint8
Self_Employed_No uint8
Self_Employed_Yes uint8
Property_Area_Rural uint8
Property_Area_Semiurban uint8
Property_Area_Urban uint8
dtype: object
=> test_ds.dtypes
ApplicantIncome int64
CoapplicantIncome int64
LoanAmount float64
Loan_Amount_Term float64
Credit_History float64
Gender_Female uint8
Gender_Male uint8
Married_No uint8
Married_Yes uint8
Dependents_0 uint8
Dependents_1 uint8
Dependents_2 uint8
Dependents_3 uint8
Education_Graduate uint8
Education_Not Graduate uint8
Self_Employed_No uint8
Self_Employed_Yes uint8
Property_Area_Rural uint8
Property_Area_Semiurban uint8
Property_Area_Urban uint8
dtype: object
您可以使用训练数据帧中的 columns
属性对测试数据帧的列重新排序:
test_ds[train_ds.columns]
我的训练和测试数据集的 features/variables 最初是有序的并且名称匹配,但是在我使用 .get_dummies() 方法将我的分类变量转换为二进制变量后 运行 逻辑回归,它会导致排序问题。导致问题的分类变量是 'Dependents' 特征,它是“1”、“2”或“3”。 get_dummies() 方法创建了 3 个不同的特征('Dependents_0'、'Dependents_1'、'Dependents_2' 和 'Dependents_3')。
在火车数据集中,出于某种原因,它是这样排序的:'Dependents_3'、'Dependents_0'、'Dependents_1'、'Dependents_2'
测试数据集的顺序正确。因此,我认为它在尝试 运行 测试数据集上的模型时会导致问题,因为我收到警告:
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.
warnings.warn(message, FutureWarning)
调用get_dummies()方法后数据集的其他信息:
=> train_ds.dtypes
ApplicantIncome int64
CoapplicantIncome float64
LoanAmount float64
Loan_Amount_Term float64
Credit_History float64
Loan_Status int64
Gender_Female uint8
Gender_Male uint8
Married_No uint8
Married_Yes uint8
Dependents_3 uint8
Dependents_0 uint8
Dependents_1 uint8
Dependents_2 uint8
Education_Graduate uint8
Education_Not Graduate uint8
Self_Employed_No uint8
Self_Employed_Yes uint8
Property_Area_Rural uint8
Property_Area_Semiurban uint8
Property_Area_Urban uint8
dtype: object
=> test_ds.dtypes
ApplicantIncome int64
CoapplicantIncome int64
LoanAmount float64
Loan_Amount_Term float64
Credit_History float64
Gender_Female uint8
Gender_Male uint8
Married_No uint8
Married_Yes uint8
Dependents_0 uint8
Dependents_1 uint8
Dependents_2 uint8
Dependents_3 uint8
Education_Graduate uint8
Education_Not Graduate uint8
Self_Employed_No uint8
Self_Employed_Yes uint8
Property_Area_Rural uint8
Property_Area_Semiurban uint8
Property_Area_Urban uint8
dtype: object
您可以使用训练数据帧中的 columns
属性对测试数据帧的列重新排序:
test_ds[train_ds.columns]