难以在R中拟合分段线性数据

Question

我有如下数据（产品成本与时间），如下所示：

annum <- c(1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 
    1914, 1915, 1916, 1917, 1918, 1919)
cost <- c(0.0000,  18.6140,  92.1278, 101.9393, 112.0808, 122.5521, 
    133.3532, 144.4843, 244.5052, 275.6068, 295.2592, 317.3145, 
    339.6527, 362.3537, 377.7775, 402.8443, 437.5539)

mydata <- as.data.frame(cbind(annum, cost))

g <- ggplot(mydata, aes(x = annum, y = cost))
g <- g + geom_point()
g <- g + scale_y_continuous(labels=scales::dollar_format())
g

This is the resulting plot of this data using this code 该图显示了一些对我来说看起来是分段线性的东西；从 1904 到 1905 有一步；然后是从 1905 年到 1910 年的清晰界线；然后一步；然后是从 1911 年到结尾的另一行。 (第一个点(1903, 0)是虚构的。)

我尝试使用分段包对此进行建模，但它没有选择 1904.5 和 1910.5 之类的东西作为断点，而是在 1911 和 1912 之间找到了两个点。

我已经尝试了一些其他技术（例如，“The R Book”中的“蛮力”和直接拟合），但我显然没有充分理解这一点。非常感谢任何帮助。

理想情况下，我最终会得到每个段的方程式和一个显示分段拟合和拟合置信区间的图。

Answer 1

可以为此使用软件包 struccchange。这里有一个简化的代码版本：

library("strucchange")

startyear <- startyear
cost <- c(0.0000,  18.6140,  92.1278, 101.9393, 112.0808, 122.5521, 
          133.3532, 144.4843, 244.5052, 275.6068, 295.2592, 317.3145, 
          339.6527, 362.3537, 377.7775, 402.8443, 437.5539)

ts <- ts(cost, start=1903)
plot(ts)

## for small data sets you might consider to reduce segment length
bp <- breakpoints(ts ~ time(ts), data=ts, h = 5)

## BIC selection of breakpoints
plot(bp)
breakdates(bp)
fm1 <- lm(ts ~ time(ts) * breakfactor(bp), data=ts)
coef(fm1)

plot(ts, type="p")
lines(ts(fitted(fm1),  start = startyear),  col = 4)
lines(bp)
confint(bp)

lines(confint(bp))

更多信息可以在包 vignette 或相关出版物之一中找到，例如https://doi.org/10.18637/jss.v007.i02 因此，例如可以进行显着性检验、估计置信区间或包括协变量。

段长度为 2 是不可能的，因为无法估计残差。同样，只有在段足够长的情况下才能估计置信区间。因此，下面只显示一个断点，而@Rui Barradas 的优秀答案省略了置信区间，但显示了两个断点。

她的例子没有前两点和一个额外的假设来估计小段情况下的置信区间：

library("strucchange")

startyear <- 1905
cost <- c(92.1278, 101.9393, 112.0808, 122.5521, 
          133.3532, 144.4843, 244.5052, 275.6068, 295.2592, 317.3145, 
          339.6527, 362.3537, 377.7775, 402.8443, 437.5539)

ts <- ts(cost, start=startyear)
bp <- breakpoints(ts ~ time(ts), data=ts, h = 5)
fm1 <- lm(ts ~ time(ts) * breakfactor(bp), data=ts)
plot(ts, type="p")
lines(ts(fitted(fm1),  start = startyear),  col = 4)
lines(confint(bp, het.err=FALSE))

编辑：

修复了原始版本的错误
添加了系数和置信区间
已添加图像
添加了省略的前 2 个值的示例

Answer 2

这是另一个解决方案 strucchange，但没有先创建时间序列。

library(strucchange)

# first get a segment size as a fraction 
# of the number of observations
n <- nrow(mydata)
segmts <- 3
h <- (segmts + 1)/n

# now estimate the breakpoints
b <- breakpoints(cost ~ annum, h = h, breaks = (segmts - 1L), data = mydata)
bp <- mydata[b$breakpoints, "annum"]

# create a grouping variable for `ggplot`
# each group is a segment
bp <- c(bp, Inf)
mydata$grp <- findInterval(mydata$annum, bp, left.open = TRUE)

# plot the linear regressions
g + geom_smooth(
  mapping = aes(group = grp),
  method = "lm",
  formula = y ~ x,
  se = FALSE
)

如果删除第一个数据点，将只有两个段，但上面的代码仍然有效。

mydata <- mydata[-(1:2), ]
n <- nrow(mydata)
segmts <- 2
h <- (segmts + 1)/n
b <- breakpoints(cost ~ annum, h = h, breaks = segmts - 1L, data = mydata)
bp <- mydata[b$breakpoints, "annum"]
bp <- c(bp, Inf)
mydata$grp <- findInterval(mydata$annum, bp, left.open = TRUE)
mydata$grp <- factor(mydata$grp)

g + geom_smooth(
  mapping = aes(group = grp),
  method = "lm",
  formula = y ~ x,
  se = FALSE
)

Answer 3

变点问题的置信区间对于频率论方法来说是一个难题，例如strucchange。通常，您只是获得每个段的置信区间，即段之间的硬中断而不是平滑过渡。

使用贝叶斯方法更直接。这是使用 mcp 包的解决方案。为了炫耀，我们绘制了拟合区间和（红色虚线）和预测区间（绿色虚线）。灰线是从后验分布中随机抽取的，x 轴上的密度是变化点位置的后验分布。

data = data.frame(
  annum = 1903:1919,
  cost = c(0.0000,  18.6140,  92.1278, 101.9393, 112.0808, 122.5521, 
          133.3532, 144.4843, 244.5052, 275.6068, 295.2592, 317.3145, 
          339.6527, 362.3537, 377.7775, 402.8443, 437.5539)
)

# Model as three disjoined slopes
model = list(
  cost ~ 1 + annum,
  ~ 1 + annum,
  ~ 1 + annum
)

library(mcp)
fit = mcp(model, data)
plot(fit, q_fit = TRUE, q_predict = TRUE)

如果您对变化点和段的参数估计感兴趣，只需调用 summary(fit):

        name    mean  lower    upper Rhat n.eff
     annum_1   -0.11   -0.2 -6.6e-04  2.5    25
     annum_2   10.36    7.4  1.3e+01  1.0   609
     annum_3   22.74   21.2  2.4e+01  1.0   264
        cp_1 1904.50 1904.0  1.9e+03  2.5    24
        cp_2 1910.46 1910.0  1.9e+03  1.0   778
 Intercept_1  221.39   10.8  3.9e+02  1.0   948
 Intercept_2   86.77   75.0  9.8e+01  1.0  1297
 Intercept_3  236.03  221.7  2.5e+02  1.0   237
     sigma_1    5.97    3.6  8.9e+00  1.0  1709

难以在R中拟合分段线性数据

Difficulty fitting piecewise linear data in R

regression

r

linear-regression

data-fitting

piecewise