逻辑回归与通过将数据拆分为 bin 来预测概率

Question

所以我正在探索使用逻辑回归模型来预测射门得分的概率。我有两个预测变量，但为简单起见，假设我有一个预测变量：与目标的距离。在进行一些数据探索时，我决定研究距离与目标结果之间的关系。我通过将数据分成大小相等的箱子，然后取每个箱子内所有结果的平均值（0 表示未命中，1 表示进球）来制作此图形。然后我绘制了每个 bin 与目标的平均距离与得分概率的关系图。我在 python

中做了这个

#use the seaborn library to inspect the distribution of the shots by result (goal or no goal) 
fig, axes = plt.subplots(1, 2,figsize=(11, 5))
#first we want to create bins to calc our probability
#pandas has a function qcut that evenly distibutes the data 
#into n bins based on a desired column value
df['Goal']=df['Goal'].astype(int)
df['Distance_Bins'] = pd.qcut(df['Distance'],q=50)
#now we want to find the mean of the Goal column(our prob density) for each bin
#and the mean of the distance for each bin
dist_prob = df.groupby('Distance_Bins',as_index=False)['Goal'].mean()['Goal']
dist_mean = df.groupby('Distance_Bins',as_index=False)['Distance'].mean()['Distance']
dist_trend = sns.scatterplot(x=dist_mean,y=dist_prob,ax=axes[0])
dist_trend.set(xlabel="Avg. Distance of Bin",
       ylabel="Probabilty of Goal",
       title="Probability of Scoring Based on Distance")

Probability of Scoring Based on Distance

所以我的问题是，当我可以将曲线拟合到图像中的绘图时，为什么我们还要经历创建逻辑回归模型的过程？那不会提供一个函数来预测距离为 x 的射门的概率吗？

我想问题在于我们将 40,000 个数据点减少到 50 个，但我不完全确定为什么这会成为预测未来镜头的问题。我们可以增加 bins 的数量还是只会增加可变性？这是偏差方差权衡的情况吗？我只是有点困惑 为什么 这不如逻辑模型好。

Answer 1

分箱方法比逻辑回归更挑剔，因为你需要尝试不同类型的图来拟合曲线（例如反比关系、对数、平方等），而对于逻辑回归你只需要调整学习率以查看结果。

如果您使用的是一个特征（您的“距离”预测器），我认为分箱方法和逻辑回归之间没有太大区别。但是，当您使用两个或多个功能时（我在您提供的图像中看到“距离”和“角度”），您打算如何组合每个功能的概率以进行最终的 0/1 分类？这可能很棘手。一方面，也许“距离”比“角度”更有用。但是，逻辑回归可以为您做到这一点，因为它可以调整权重。

关于您的分箱方法，如果您使用较少的分箱，您可能会看到更多偏差，因为数据可能比您想象的更复杂，但这不太可能，因为您的数据乍一看非常简单。但是，如果您使用不会显着增加方差的更多 bin，假设您在不改变曲线顺序的情况下拟合曲线。如果你改变你拟合的曲线的顺序，那么是的，它会增加方差。但是，如果您使用此方法，您的数据似乎可以非常简单地拟合。

逻辑回归与通过将数据拆分为 bin 来预测概率

Logistic Regression vs predicting probability by splitting data into bin

python

statistics

machine-learning

bins

logistic-regression