Yellowbrick:PredictionError 维度问题
Yellowbrick: PredictionError dimensionality issue
我正在尝试使用 yellowbrick PredictionError 并且 运行 遇到了奇怪的维度问题。我正在使用 yellowbrick 1.4 版。
假设我们有这个非常简单的线性回归:
import pandas as pd
import numpy as np
import matplotlib as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from yellowbrick.regressor import PredictionError, ResidualsPlot
X = pd.DataFrame({
"x1": np.linspace(1, 1000, 800),
"x2": np.linspace(2, 500, 800),
"x3": np.random.rand(800) * 50
})
y = pd.DataFrame().assign(y_val = 3 * X.x1 + 4 * X.x2 + X.x3)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
现在我想运行诊断。 ResidualsPlot 很容易工作,传入 Pandas 数据结构未修改:
rp = ResidualsPlot(model)
rp.fit(X_train, y_train)
rp.score(X_test, y_test)
rp.show()
# produces graphic (not shown)
然而,当我尝试使用 PredictionError:
pe = PredictionError(model)
pe.fit(X_train, y_train)
pe.score(X_test, y_test)
对 score()
的调用产生此错误消息:
File ~/venv/lib/python3.9/site-packages/yellowbrick/bestfit.py:141, in draw_best_fit(X, y, ax, estimator, **kwargs)
139 # Verify that y is a (n,) dimensional array
140 if y.ndim > 1:
--> 141 raise YellowbrickValueError(
142 "y must be a (1,) dimensional array not {}".format(y.shape)
143 )
145 # Uses the estimator to fit the data and get the model back.
146 model = estimator(X, y)
YellowbrickValueError: y must be a (1,) dimensional array not (264, 1)
现在我意识到 y
的类型是 DataFrame
。如果我将其更改为 Series
,代码将起作用,例如:
# Same as before, for reference
y = pd.DataFrame().assign(y_val= 3 * X.x1 + 4 * X.x2 + X.x3)
# Change to Series here
y = y["y_val"]
转换为 Series
当然是一个可行的解决方法,但我想知道为什么这里是这种情况而不是 ResidualsPlot
。
PredictionError 访问了一个 draw_best_fit 函数,用于检查 y 是否只是一维,并且该函数未在 ResidualPlot 中使用。也许您可以提交建议修复的 PR。 https://github.com/DistrictDataLabs/yellowbrick/
我正在尝试使用 yellowbrick PredictionError 并且 运行 遇到了奇怪的维度问题。我正在使用 yellowbrick 1.4 版。
假设我们有这个非常简单的线性回归:
import pandas as pd
import numpy as np
import matplotlib as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from yellowbrick.regressor import PredictionError, ResidualsPlot
X = pd.DataFrame({
"x1": np.linspace(1, 1000, 800),
"x2": np.linspace(2, 500, 800),
"x3": np.random.rand(800) * 50
})
y = pd.DataFrame().assign(y_val = 3 * X.x1 + 4 * X.x2 + X.x3)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
现在我想运行诊断。 ResidualsPlot 很容易工作,传入 Pandas 数据结构未修改:
rp = ResidualsPlot(model)
rp.fit(X_train, y_train)
rp.score(X_test, y_test)
rp.show()
# produces graphic (not shown)
然而,当我尝试使用 PredictionError:
pe = PredictionError(model)
pe.fit(X_train, y_train)
pe.score(X_test, y_test)
对 score()
的调用产生此错误消息:
File ~/venv/lib/python3.9/site-packages/yellowbrick/bestfit.py:141, in draw_best_fit(X, y, ax, estimator, **kwargs)
139 # Verify that y is a (n,) dimensional array
140 if y.ndim > 1:
--> 141 raise YellowbrickValueError(
142 "y must be a (1,) dimensional array not {}".format(y.shape)
143 )
145 # Uses the estimator to fit the data and get the model back.
146 model = estimator(X, y)
YellowbrickValueError: y must be a (1,) dimensional array not (264, 1)
现在我意识到 y
的类型是 DataFrame
。如果我将其更改为 Series
,代码将起作用,例如:
# Same as before, for reference
y = pd.DataFrame().assign(y_val= 3 * X.x1 + 4 * X.x2 + X.x3)
# Change to Series here
y = y["y_val"]
转换为 Series
当然是一个可行的解决方法,但我想知道为什么这里是这种情况而不是 ResidualsPlot
。
PredictionError 访问了一个 draw_best_fit 函数,用于检查 y 是否只是一维,并且该函数未在 ResidualPlot 中使用。也许您可以提交建议修复的 PR。 https://github.com/DistrictDataLabs/yellowbrick/