我在使用单热编码时遇到问题

Question

我正在对足球数据集使用逻辑回归，但似乎当我尝试对主队名称和客队名称进行单热编码时，它使模型具有 100% 的准确性，即使在执行 train_test_split 我仍然得到 100。我做错了什么？

from sklearn.linear_model 
import LogisticRegression
from sklearn.model_selection  import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
df = pd.read_csv("FIN.csv")
df['Date'] = pd.to_datetime(df["Date"])
df = df[(df["Date"] > '2020/04/01')]
df['BTTS'] = np.where((df.HG > 0) & (df.AG > 0), 1, 0)
#print(df.to_string())
df.dropna(inplace=True)
x = df[['Home', 'Away', 'Res', 'HG', 'AG', 'PH', 'PD', 'PA', 'MaxH', 'MaxD', 'MaxA', 'AvgH', 'AvgD', 'AvgA']].values
y = df['BTTS'].values

np.set_printoptions(threshold=np.inf)
model = LogisticRegression()
ohe = OneHotEncoder(categories=[df.Home, df.Away, df.Res], sparse=False)
x = ohe.fit_transform(x)
print(x)
model.fit(x, y)
print(model.score(x, y))
x_train, x_test, y_train, y_test = train_test_split(x, y, shuffle=False)
model.fit(x_train, y_train)
print(model.score(x_test, y_test))
y_pred = model.predict(x_test)
print("accuracy:", 
accuracy_score(y_test, y_pred))
print("precision:", precision_score(y_test, y_pred))
print("recall:", recall_score(y_test, y_pred))
print("f1 score:", f1_score(y_test, y_pred))

Answer 1

过拟合就是你的训练准确率很高，而测试准确率很低的情况。这意味着它“过度拟合”，因为它本质上只是了解训练的结果，但不能很好地适应新的、看不见的数据。

你获得 100% 准确率的原因正是我在评论中所说的，存在（没有更好的术语）数据泄漏。您实际上是在允许您的模型“作弊”。您的目标变量 y（即 'BTTS'）是由数据设计的特征。它源自 'HG' 和 'AG'，因此与您的目标高度 (100%) correlated/associated。当 'HG' 和 'AG' 都大于 1 时，您将 'BTTS' 定义为 1。然后您将这两列包含在训练数据中。所以模型简单地拾取了那个明显的关联（即，当主场进球为 1 个或更多，客场进球为 1 个或更多 -> 两支球队都得分）。

一旦模型看到这 2 个值大于 0，它就预测 1，如果其中一个值是 0，它就预测 0。

从 x（特征）中删除 'HG' 和“AG”。

一旦我们删除了这两列，您将在此处看到更真实的表现（虽然很差 - 比掷硬币略好）：

1.0
0.5625
accuracy: 0.5625
precision: 0.6666666666666666
recall: 0.4444444444444444
f1 score: 0.5333333333333333

使用混淆矩阵：

from sklearn.metrics import confusion_matrix
labels = labels = np.unique(y).tolist()
cf_matrixGNB = confusion_matrix(y_test, y_pred, labels=labels)


import seaborn as sns
import matplotlib.pyplot as plt

ax = sns.heatmap(cf_matrixGNB, annot=True, 
             cmap='Blues')

ax.set_title('Confusion Matrix\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

plt.show()

另一种选择是计算 'Total_Goals' 的字段，然后查看它是否可以对此进行预测。显然，它在显而易见的方面也有一点帮助（如果 'Total_Goals' 为 0 或 1，则 'BTTS' 将为 0。）。但是，如果 'Total_Goals' 为 2 或更多，则如果其中一个团队被拒之门外，它将不得不依靠其他功能来尝试计算。

这是这个例子：

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection  import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np


df = pd.read_csv("FIN.csv")
df['Date'] = pd.to_datetime(df["Date"])
df = df[(df["Date"] > '2020/04/01')]
df['BTTS'] = np.where((df.HG > 0) & (df.AG > 0), 1, 0)
#print(df.to_string())
df.dropna(inplace=True)
df['Total_Goals'] = df['HG'] + df['AG']

x = df[['Home', 'Away', 'Res', 'Total_Goals', 'PH', 'PD', 'PA', 'MaxH', 'MaxD', 'MaxA', 'AvgH', 'AvgD', 'AvgA']].values
y = df['BTTS'].values

np.set_printoptions(threshold=np.inf)
model = LogisticRegression()

ohe = OneHotEncoder(sparse=False)
x = ohe.fit_transform(x)
#print(x)
model.fit(x, y)
print(model.score(x, y))
x_train, x_test, y_train, y_test = train_test_split(x, y, shuffle=False)
model.fit(x_train, y_train)
print(model.score(x_test, y_test))
y_pred = model.predict(x_test)
print("accuracy:", 
accuracy_score(y_test, y_pred))
print("precision:", precision_score(y_test, y_pred))
print("recall:", recall_score(y_test, y_pred))
print("f1 score:", f1_score(y_test, y_pred))

from sklearn.metrics import confusion_matrix
labels = np.unique(y).tolist()
cf_matrixGNB = confusion_matrix(y_test, y_pred, labels=labels)


import seaborn as sns
import matplotlib.pyplot as plt

ax = sns.heatmap(cf_matrixGNB, annot=True, 
             cmap='Blues')

ax.set_title('Confusion Matrix\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

plt.show()

输出：

1.0
0.8
accuracy: 0.8
precision: 0.8536585365853658
recall: 0.7777777777777778
f1 score: 0.8139534883720929

要预测新数据，您需要训练数据形式的新数据。然后，您还需要应用适合试验数据的任何转换，以转换新数据：

new_data = pd.DataFrame(
        data = [['Haka', 'Mariehamn', 3.05,     3.66,   2.35,   3.05,   3.66,   2.52,   2.88,   3.48,   2.32]],
        columns = ['Home', 'Away', 'PH', 'PD', 'PA', 'MaxH', 'MaxD', 'MaxA', 'AvgH', 'AvgD', 'AvgA']
        )

to_predcit = new_data[['Home', 'Away', 'PH', 'PD', 'PA', 'MaxH', 'MaxD', 'MaxA', 'AvgH', 'AvgD', 'AvgA']]

to_predict_encoded = ohe.transform(to_predcit)
prediction = model.predict(to_predict_encoded)
prediction_prob = model.predict_proba(to_predict_encoded)

print(f'Predict: {prediction[0]} with {prediction_prob[0][0]} probability.')

输出：

Predict: 0 with 0.8204957018099501 probability.

我在使用单热编码时遇到问题

I'm having problems with one-hot encoding

numpy

scikit-learn