如何计算非线性模型的调整后 R2 分数

How to calculate adjusted R2 score for non-linear models

如本中所述,调整后的R2分数可以通过以下等式计算,其中n是样本数,p是参数数模型。

adj_r2 = 1-(1-R2)*(n-1)/(n-p-1)

根据另一个,我们可以得到我们模型的参数数量model.coef_

然而,对于梯度提升(GBM),我们似乎无法获得模型中的参数数量:

from sklearn.ensemble import GradientBoostingRegressor
import numpy as np

X = np.random.randn(100,10)
y = np.random.randn(100,1)

model = GradientBoostingRegressor()
model.fit(X,y)

model.coef_

output >>> 
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-4650e3f7c16c> in <module>
----> 1 model.coef_

AttributeError: 'GradientBoostingRegressor' object has no attribute 'coef_'

检查documentation后,GBM似乎由不同的估计器组成。估计器的数量是否等于参数的数量?

仍然,我无法获得每个单独估计器的参数数量

model.estimators_[0][0].coef_


output >>> 
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-27216ebb4944> in <module>
----> 1 model.estimators_[0][0].coef_

AttributeError: 'DecisionTreeRegressor' object has no attribute 'coef_'

如何计算 GBM 的调整后 R2 分数?

简短回答:不要这样做(请注意,您 link 的所有 post 都是关于 线性回归)。


长答案:

首先,你的定义

p is the number of parameters of the model

正确。 p 是模型使用的解释变量的数量 (source)。

根据此定义,您 link 编辑的 实际上使用 X.shape[1] 而不是 model.coef_;后者在评论中被建议,但也不正确(请参阅此处的评论)。

因此,如果您坚持为您的 GBM 模型计算 r 平方,您可以随时调整来自 linked post 的代码(在获得您的预测后 y_pred ), 还利用 scikit-learn r2_score:

from sklearn.metrics import r2_score

y_pred = model.predict(X)
r_squared = r2_score(y, y_pred)
adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)

但是为什么你不应该这样做呢?好吧,引用另一个问题中的

the whole R-squared concept comes in fact directly from the world of statistics, where the emphasis is on interpretative models, and it has little use in machine learning contexts, where the emphasis is clearly on predictive models; at least AFAIK, and beyond some very introductory courses, I have never (I mean never...) seen a predictive modeling problem where the R-squared is used for any kind of performance assessment; neither it's an accident that popular machine learning introductions, such as Andrew Ng's Machine Learning at Coursera, do not even bother to mention it.