绘制拟合分段线性模型显示的断点比估计的多

Question

今天我正在帮助一位朋友进行分段回归。我们试图用断点拟合分段回归，看看它是否比标准线性模型更适合数据。

我遇到了一个我无法理解的问题。当用提供的数据拟合单断点分段回归时，确实拟合了单断点。

但是，当您根据模型进行预测时，它会给出看起来像 2 个断点的内容。使用 plot.segmented() 绘制模型时不会发生此问题。

有谁知道发生了什么以及我如何才能得到正确的预测（和标准误差等）？或者我在一般代码中做错了什么？

# load packages
library(segmented)

# make data
d <- data.frame(x = c(0, 3, 13, 18, 19, 19, 26, 26, 33, 40, 49, 51, 53, 67, 70, 88
),
                y = c(0, 3.56211608128595, 10.5214485148819, 3.66063708049802, 6.11000808621074, 
                      5.51520423804034, 7.73043895812661, 7.90691392857039, 6.59626527933846, 
                      10.4413913666936, 8.71673928545967, 9.93374157928462, 1.214860139929, 
                      3.32428882257746, 2.65223361387063, 3.25440939462105))

# fit normal linear regression and segmented regression
lm1 <- lm(y ~ x, d)
seg_lm <- segmented(lm1, ~ x)

slope(seg_lm)
#> $x
#>            Est.  St.Err. t value CI(95%).l   CI(95%).u
#> slope1  0.17185 0.094053  1.8271 -0.033079  0.37677000
#> slope2 -0.15753 0.071933 -2.1899 -0.314260 -0.00079718

# make predictions
preds <- data.frame(x = d$x, preds = predict(seg_lm))

# plot segmented fit
plot(seg_lm, res = TRUE)

# plot predictions
lines(preds$preds ~ preds$x, col = 'red')

由 reprex 创建于 2018-07-27 包 (v0.2.0).

Answer 1

纯粹是剧情问题

#Call: segmented.lm(obj = lm1, seg.Z = ~x)
#
#Meaningful coefficients of the linear terms:
#(Intercept)            x         U1.x  
#     2.7489       0.1712      -0.3291  
#
#Estimated Break-Point(s):
#psi1.x  
# 37.46

断点估计在x = 37.46，这不是任何采样位置：

d$x
# [1]  0  3 13 18 19 19 26 26 33 40 49 51 53 67 70 88

如果您在这些采样位置使用拟合值制作绘图，

preds <- data.frame(x = d$x, preds = predict(seg_lm))
lines(preds$preds ~ preds$x, col = 'red')

你不会在视觉上看到那些拟合的两个片段在断点处连接起来，因为 lines 只是将拟合值一个一个地排列起来。 plot.segmented 相反会观察断点并做出正确的绘图。

尝试以下操作：

## the fitted model is piecewise linear between boundary points and break points
xp <- c(min(d$x), seg_lm$psi[, "Est."], max(d$x))
yp <- predict(seg_lm, newdata = data.frame(x = xp))

plot(d, col = 8, pch = 19)  ## observations
lines(xp, yp)  ## fitted model
points(d$x, seg_lm$fitted, pch = 19)  ## fitted values
abline(v = d$x, col = 8, lty = 2)  ## highlight sampling locations

Answer 2

我对你用的软件不熟悉，无法具体回答。不过，我尝试使用自己的软件（自制）并得到了这个：

两个连接段的情况：

这似乎与您的结果一致。

两个未连接段的情况：

三个连接段的情况：

我们观察到，在两个未连接的线段的情况下，均方误差最小，这对于如此大的分散并不奇怪。

三个连接段的情况很有趣。结果介于其他两个之间。添加的段在其他两个段之间形成几乎垂直的 link。

嗯，这并不能解释您使用的软件的奇怪结果。不知为什么这个软件找不到最小的三段MSE

你得到的预测（两个大段 link 由一个非常小的段编辑）给出了与没有小段完全相同的 MSE，因为没有与小段相关的实验点。在没有相关实验点的情况下，通过添加 "dummy" 个小片段可以找到无穷无尽的等价解。

如下图所示，放大了 "branching zone" 以使其更容易理解。

2 段解是 (AC)+(CB)。

前3段的解是(AD)+(DE)+(EB)。

另一种3段解决方案是(AF)+(FG)+(GB)。

另一种3段解决方案是(AH)+(HI)+(IB)。

可以想象很多其他...

所有这些解决方案都具有相同的 MSE。因此，从统计学的角度来看，以均方误差为标准，它们可以被认为是等价的。

绘制拟合分段线性模型显示的断点比估计的多

plotting a fitted segmented linear model shows more break points than what is estimated

regression

r

linear-regression

piecewise

lm