Pandas 逻辑回归的数据集特征顺序错误

Question

我的训练和测试数据集的 features/variables 最初是有序的并且名称匹配，但是在我使用 .get_dummies() 方法将我的分类变量转换为二进制变量后运行逻辑回归，它会导致排序问题。导致问题的分类变量是 'Dependents' 特征，它是“1”、“2”或“3”。 get_dummies() 方法创建了 3 个不同的特征（'Dependents_0'、'Dependents_1'、'Dependents_2' 和 'Dependents_3'）。

在火车数据集中，出于某种原因，它是这样排序的：'Dependents_3'、'Dependents_0'、'Dependents_1'、'Dependents_2'

测试数据集的顺序正确。因此，我认为它在尝试运行测试数据集上的模型时会导致问题，因为我收到警告：

/usr/local/lib/python3.7/dist-packages/sklearn/base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)

调用get_dummies()方法后数据集的其他信息：

=> train_ds.dtypes
ApplicantIncome              int64
CoapplicantIncome          float64
LoanAmount                 float64
Loan_Amount_Term           float64
Credit_History             float64
Loan_Status                  int64
Gender_Female                uint8
Gender_Male                  uint8
Married_No                   uint8
Married_Yes                  uint8
Dependents_3                 uint8
Dependents_0                 uint8
Dependents_1                 uint8
Dependents_2                 uint8
Education_Graduate           uint8
Education_Not Graduate       uint8
Self_Employed_No             uint8
Self_Employed_Yes            uint8
Property_Area_Rural          uint8
Property_Area_Semiurban      uint8
Property_Area_Urban          uint8
dtype: object


=> test_ds.dtypes
ApplicantIncome              int64
CoapplicantIncome            int64
LoanAmount                 float64
Loan_Amount_Term           float64
Credit_History             float64
Gender_Female                uint8
Gender_Male                  uint8
Married_No                   uint8
Married_Yes                  uint8
Dependents_0                 uint8
Dependents_1                 uint8
Dependents_2                 uint8
Dependents_3                 uint8
Education_Graduate           uint8
Education_Not Graduate       uint8
Self_Employed_No             uint8
Self_Employed_Yes            uint8
Property_Area_Rural          uint8
Property_Area_Semiurban      uint8
Property_Area_Urban          uint8
dtype: object

Answer 1

您可以使用训练数据帧中的 columns 属性对测试数据帧的列重新排序：

test_ds[train_ds.columns]

Pandas 逻辑回归的数据集特征顺序错误

Pandas dataset features in wrong order for logistic regression

python

pandas

logistic-regression