将数据分成两个单独的组 s.t。最小化具有一个连续预测变量的残差平方和

Partition data into two separate groups s.t. residual sum of squares with one continuous predictor is minimized

将一组数据分成两组的基本算法是什么s.t。两个单独的残差平方和之和最小化?例如,考虑下面的代码。基本上,如何在不迭代测试每个可能值的情况下计算存储在 best.cutpoint$RSS 中的值?

set.seed(1)
ind.var <- runif(1000, 1, 50000)
dep.var <- ind.var * runif(1000, 2, 3) + rnorm(1000, 100, 500)

dat <- data.frame(ind.var, dep.var)

best.cutpoint <- list(RSS = Inf, cutpoint = NA)
for(cutpoint in sort(unique(ind.var))){
    # partition data
    dat1 <- dat[dat$ind.var > cutpoint,]
    dat2 <- dat[!(dat$ind.var > cutpoint),]

    if(nrow(dat1) < 2 | nrow(dat2) < 2){
        next
    }
    # estimate
    mod1 <- lm(dep.var ~ ind.var, dat = dat1)
    mod2 <- lm(dep.var ~ ind.var, dat = dat2)

    # calculate RSS
    part1.RSS <- sum((dat1$dep.var - (mod1$coefficients['(Intercept)'] + dat1$ind.var * mod1$coefficients['ind.var'])) ^ 2)
    part2.RSS <- sum((dat2$dep.var - (mod2$coefficients['(Intercept)'] + dat2$ind.var * mod2$coefficients['ind.var'])) ^ 2)

    total <- part1.RSS + part2.RSS

    if(total < best.cutpoint$RSS){
        best.cutpoint <- list(RSS = total, cutpoint = cutpoint)
    }
}

从以下可能值范围生成以下结果。

> print(best.cutpoint)
$RSS
[1] 75241532557

$cutpoint
[1] 34351.46

> range(dat$ind.var)
[1]    66.73151 49996.52975

我觉得你在问如何确定 segmented or piecewise linear regression 的断点。如果情况并非如此,请告诉我。

该软件包可用于此目的Segmented

首先让我们生成一些数据:

x<-seq(1:20)
y<-c(seq(1:10),seq(10,100,by=10))
plot(x,y)

这个数据看起来像,

很明显 "breakpoint" 在哪里。

接下来,让我们用分段包拟合模型:

library(segmented)
lin.mod <- lm(y~x)
segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=14)

找到断点了吗?

plot(segmented.mod)
points(x,y)

看起来确实如此。

> segmented.mod
Call: segmented.lm(obj = lin.mod, seg.Z = ~x, psi = 14)

Meaningful coefficients of the linear terms:
(Intercept)            x         U1.x  
     0.1818       0.9545       9.0455  

Estimated Break-Point(s) psi1.x : 11.08 

其中 seg.z 和 psi 定义为:

seg.Z a formula with no response variable, such as seg.Z=~x1+x2, indicating the
(continuous) explanatory variables having segmented relationships with the response.
Currently, formulas involving functions, such as seg.Z=~log(x1) or
seg.Z=~sqrt(x1), or selection operators, such as seg.Z=~d[,"x1"] or seg.Z=~d$x1,
are not allowed.
psi named list of vectors. The names have to match the variables of the seg.Z
argument. Each vector includes starting values for the break-point(s) for the
corresponding variable in seg.Z. If seg.Z includes only a variable, psi may be
a numeric vector. A NA value means that ‘K’ quantiles (or equally spaced values)
are used as starting values; K is fixed via the seg.control auxiliary function.