为什么我的逻辑回归只产生一个 class？

Question

我尝试了我的第一个机器学习项目，使用的是来自 Kaggle 的虚构数据集，其中包含 1470 条记录。 84% 的记录为“0”class，16% 为“1”。我使用 1200 条记录进行训练和测试，并保存 270 条作为新数据输入以查看会发生什么。我最终得到了 87% 的训练分数和 83% 的测试分数，但是所有 270 条新数据记录都被 class 化为 0。

会不会是虚构的数据不足以形成足够的模式来教机器如何 classify？还是我做错了什么？

我已经阅读了其他一些似乎与类似问题有关的帖子，但我没有找到相关的回复。任何帮助将不胜感激。

df=pd.read_csv('Resources/train_data.csv')
    
df_skinny =df.drop(['EducationField','EmployeeCount','EmployeeNumber','index',
    'StandardHours', 
    'JobRole','MaritalStatus','DailyRate','MonthlyRate','HourlyRate','Over18','OverTime'], 
    axis=1).drop_duplicates()
    df_skinny.rename(columns={"Attrition": "EmploymentStatus"}, inplace=True)
    df_skinny['EmploymentStatus'] = df_skinny['EmploymentStatus'].replace(['Yes','No'],[1,0])

df_skinny['Gender']=df_skinny['Gender'].replace(['Female','Male'],[0, 1]) df_skinny['BusinessTravel'] = df_skinny['BusinessTravel'].replace(['Travel_Rarely','Travel_Frequently','Non-Travel'],[1,2,0]) df_skinny['Department']=df_skinny['Department'].replace(['Human Resources','Sales','R&D'],[0 ,1,2])

df_train=df_skinny[:1200]
df_new=df_skinny[1201:]

X =df_train.drop("EmploymentStatus", axis=1)
y = df_train["EmploymentStatus"]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

from sklearn.preprocessing import StandardScaler
X_scaler = StandardScaler().fit(X_train)
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()

classifier.fit(X_train_scaled, y_train)

print(f"Training Data Score: {classifier.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test_scaled, y_test)}")

predictions = classifier.predict(X_test_scaled)
print(f"First 30 Predictions:   {predictions[:30]}")
print(f"First 30 Actual Employment Status: {y_test[:30].tolist()}")

new_X = df_new.drop("EmploymentStatus", axis=1)
new_predictions=classifier.predict(new_X)
print(new_predictions)

ynew = classifier.predict_proba(new_X)
print(ynew)

OUTPUT:
Training Data Score: 0.8655555555555555
Testing Data Score: 0.8333333333333334

First 30 Predictions:   [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0]

First 30 Actual Employment Status: [1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0] 

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0]

[[1.00000000e+000 0.00000000e+000]
 [1.00000000e+000 0.00000000e+000]
 [1.00000000e+000 0.00000000e+000]
 [1.00000000e+000 0.00000000e+000]
 [1.00000000e+000 5.24119991e-298]
 [1.00000000e+000 7.88999798e-158]
 [1.00000000e+000 2.73485216e-286]
 [1.00000000e+000 0.00000000e+000]
 [1.00000000e+000 0.00000000e+000]

Answer 1

正如您提到的，84% 的数据在 class 0 中，16% 在 class 1 中。这是非常不平衡的数据，在这种情况下模型会非常有偏差。这就是为什么您得到的结果大多为 0。

一个好的数据集是在所有 class 之间具有平衡数据的东西。因此，您需要使用 Random sampling 技术使其平衡。采样有两种oversampling和undersampling.

我建议您首先应用采样技术来平衡您的数据。

您可以从下面的文章中了解更多信息 https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/

你可以参考这个笔记本 https://www.kaggle.com/shweta2407/oversampling-vs-undersampling-techniques

为什么我的逻辑回归只产生一个 class？

Why does my logistic regression yield only one class?

python

logistic-regression