Yellowbrick：PredictionError 维度问题

Question

我正在尝试使用 yellowbrick PredictionError 并且运行遇到了奇怪的维度问题。我正在使用 yellowbrick 1.4 版。

假设我们有这个非常简单的线性回归：

import pandas as pd 
import numpy as np
import matplotlib as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from yellowbrick.regressor import PredictionError, ResidualsPlot

X = pd.DataFrame({
    "x1": np.linspace(1, 1000, 800),
    "x2": np.linspace(2, 500, 800),
    "x3": np.random.rand(800) * 50
})
y = pd.DataFrame().assign(y_val = 3 * X.x1 + 4 * X.x2 + X.x3)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)

现在我想运行诊断。 ResidualsPlot 很容易工作，传入 Pandas 数据结构未修改：

rp = ResidualsPlot(model)
rp.fit(X_train, y_train)
rp.score(X_test, y_test)
rp.show()
# produces graphic (not shown)

然而，当我尝试使用 PredictionError:

pe = PredictionError(model)
pe.fit(X_train, y_train)
pe.score(X_test, y_test)

对 score() 的调用产生此错误消息：

File ~/venv/lib/python3.9/site-packages/yellowbrick/bestfit.py:141, in draw_best_fit(X, y, ax, estimator, **kwargs)
    139 # Verify that y is a (n,) dimensional array
    140 if y.ndim > 1:
--> 141     raise YellowbrickValueError(
    142         "y must be a (1,) dimensional array not {}".format(y.shape)
    143     )
    145 # Uses the estimator to fit the data and get the model back.
    146 model = estimator(X, y)

YellowbrickValueError: y must be a (1,) dimensional array not (264, 1)

现在我意识到 y 的类型是 DataFrame。如果我将其更改为 Series，代码将起作用，例如：

# Same as before, for reference
y = pd.DataFrame().assign(y_val= 3 * X.x1 + 4 * X.x2 + X.x3)

# Change to Series here
y = y["y_val"]

转换为 Series 当然是一个可行的解决方法，但我想知道为什么这里是这种情况而不是 ResidualsPlot。

Answer 1

PredictionError 访问了一个 draw_best_fit 函数，用于检查 y 是否只是一维，并且该函数未在 ResidualPlot 中使用。也许您可以提交建议修复的 PR。 https://github.com/DistrictDataLabs/yellowbrick/

Yellowbrick：PredictionError 维度问题

Yellowbrick: PredictionError dimensionality issue

python

pandas

scikit-learn

yellowbrick