为什么我的逻辑回归只产生一个 class?
Why does my logistic regression yield only one class?
我尝试了我的第一个机器学习项目,使用的是来自 Kaggle 的虚构数据集,其中包含 1470 条记录。 84% 的记录为“0”class,16% 为“1”。我使用 1200 条记录进行训练和测试,并保存 270 条作为新数据输入以查看会发生什么。我最终得到了 87% 的训练分数和 83% 的测试分数,但是所有 270 条新数据记录都被 class 化为 0。
会不会是虚构的数据不足以形成足够的模式来教机器如何 classify?还是我做错了什么?
我已经阅读了其他一些似乎与类似问题有关的帖子,但我没有找到相关的回复。任何帮助将不胜感激。
df=pd.read_csv('Resources/train_data.csv')
df_skinny =df.drop(['EducationField','EmployeeCount','EmployeeNumber','index',
'StandardHours',
'JobRole','MaritalStatus','DailyRate','MonthlyRate','HourlyRate','Over18','OverTime'],
axis=1).drop_duplicates()
df_skinny.rename(columns={"Attrition": "EmploymentStatus"}, inplace=True)
df_skinny['EmploymentStatus'] = df_skinny['EmploymentStatus'].replace(['Yes','No'],[1,0])
df_skinny['Gender']=df_skinny['Gender'].replace(['Female','Male'],[0, 1])
df_skinny['BusinessTravel'] =
df_skinny['BusinessTravel'].replace(['Travel_Rarely','Travel_Frequently','Non-Travel'],[1,2,0])
df_skinny['Department']=df_skinny['Department'].replace(['Human Resources','Sales','R&D'],[0 ,1,2])
df_train=df_skinny[:1200]
df_new=df_skinny[1201:]
X =df_train.drop("EmploymentStatus", axis=1)
y = df_train["EmploymentStatus"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
from sklearn.preprocessing import StandardScaler
X_scaler = StandardScaler().fit(X_train)
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train_scaled, y_train)
print(f"Training Data Score: {classifier.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test_scaled, y_test)}")
predictions = classifier.predict(X_test_scaled)
print(f"First 30 Predictions: {predictions[:30]}")
print(f"First 30 Actual Employment Status: {y_test[:30].tolist()}")
new_X = df_new.drop("EmploymentStatus", axis=1)
new_predictions=classifier.predict(new_X)
print(new_predictions)
ynew = classifier.predict_proba(new_X)
print(ynew)
OUTPUT:
Training Data Score: 0.8655555555555555
Testing Data Score: 0.8333333333333334
First 30 Predictions: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0]
First 30 Actual Employment Status: [1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0]
[[1.00000000e+000 0.00000000e+000]
[1.00000000e+000 0.00000000e+000]
[1.00000000e+000 0.00000000e+000]
[1.00000000e+000 0.00000000e+000]
[1.00000000e+000 5.24119991e-298]
[1.00000000e+000 7.88999798e-158]
[1.00000000e+000 2.73485216e-286]
[1.00000000e+000 0.00000000e+000]
[1.00000000e+000 0.00000000e+000]
正如您提到的,84% 的数据在 class 0 中,16% 在 class 1 中。这是非常不平衡的数据,在这种情况下模型会非常有偏差。这就是为什么您得到的结果大多为 0。
一个好的数据集是在所有 class 之间具有平衡数据的东西。因此,您需要使用 Random sampling
技术使其平衡。采样有两种oversampling
和undersampling
.
我建议您首先应用采样技术来平衡您的数据。
您可以从下面的文章中了解更多信息
https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/
你可以参考这个笔记本
https://www.kaggle.com/shweta2407/oversampling-vs-undersampling-techniques
我尝试了我的第一个机器学习项目,使用的是来自 Kaggle 的虚构数据集,其中包含 1470 条记录。 84% 的记录为“0”class,16% 为“1”。我使用 1200 条记录进行训练和测试,并保存 270 条作为新数据输入以查看会发生什么。我最终得到了 87% 的训练分数和 83% 的测试分数,但是所有 270 条新数据记录都被 class 化为 0。
会不会是虚构的数据不足以形成足够的模式来教机器如何 classify?还是我做错了什么?
我已经阅读了其他一些似乎与类似问题有关的帖子,但我没有找到相关的回复。任何帮助将不胜感激。
df=pd.read_csv('Resources/train_data.csv')
df_skinny =df.drop(['EducationField','EmployeeCount','EmployeeNumber','index',
'StandardHours',
'JobRole','MaritalStatus','DailyRate','MonthlyRate','HourlyRate','Over18','OverTime'],
axis=1).drop_duplicates()
df_skinny.rename(columns={"Attrition": "EmploymentStatus"}, inplace=True)
df_skinny['EmploymentStatus'] = df_skinny['EmploymentStatus'].replace(['Yes','No'],[1,0])
df_skinny['Gender']=df_skinny['Gender'].replace(['Female','Male'],[0, 1]) df_skinny['BusinessTravel'] = df_skinny['BusinessTravel'].replace(['Travel_Rarely','Travel_Frequently','Non-Travel'],[1,2,0]) df_skinny['Department']=df_skinny['Department'].replace(['Human Resources','Sales','R&D'],[0 ,1,2])
df_train=df_skinny[:1200]
df_new=df_skinny[1201:]
X =df_train.drop("EmploymentStatus", axis=1)
y = df_train["EmploymentStatus"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
from sklearn.preprocessing import StandardScaler
X_scaler = StandardScaler().fit(X_train)
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train_scaled, y_train)
print(f"Training Data Score: {classifier.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test_scaled, y_test)}")
predictions = classifier.predict(X_test_scaled)
print(f"First 30 Predictions: {predictions[:30]}")
print(f"First 30 Actual Employment Status: {y_test[:30].tolist()}")
new_X = df_new.drop("EmploymentStatus", axis=1)
new_predictions=classifier.predict(new_X)
print(new_predictions)
ynew = classifier.predict_proba(new_X)
print(ynew)
OUTPUT:
Training Data Score: 0.8655555555555555
Testing Data Score: 0.8333333333333334
First 30 Predictions: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0]
First 30 Actual Employment Status: [1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0]
[[1.00000000e+000 0.00000000e+000]
[1.00000000e+000 0.00000000e+000]
[1.00000000e+000 0.00000000e+000]
[1.00000000e+000 0.00000000e+000]
[1.00000000e+000 5.24119991e-298]
[1.00000000e+000 7.88999798e-158]
[1.00000000e+000 2.73485216e-286]
[1.00000000e+000 0.00000000e+000]
[1.00000000e+000 0.00000000e+000]
正如您提到的,84% 的数据在 class 0 中,16% 在 class 1 中。这是非常不平衡的数据,在这种情况下模型会非常有偏差。这就是为什么您得到的结果大多为 0。
一个好的数据集是在所有 class 之间具有平衡数据的东西。因此,您需要使用 Random sampling
技术使其平衡。采样有两种oversampling
和undersampling
.
我建议您首先应用采样技术来平衡您的数据。
您可以从下面的文章中了解更多信息 https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/
你可以参考这个笔记本 https://www.kaggle.com/shweta2407/oversampling-vs-undersampling-techniques