将数据分成两个单独的组 s.t。最小化具有一个连续预测变量的残差平方和
Partition data into two separate groups s.t. residual sum of squares with one continuous predictor is minimized
将一组数据分成两组的基本算法是什么s.t。两个单独的残差平方和之和最小化?例如,考虑下面的代码。基本上,如何在不迭代测试每个可能值的情况下计算存储在 best.cutpoint$RSS
中的值?
set.seed(1)
ind.var <- runif(1000, 1, 50000)
dep.var <- ind.var * runif(1000, 2, 3) + rnorm(1000, 100, 500)
dat <- data.frame(ind.var, dep.var)
best.cutpoint <- list(RSS = Inf, cutpoint = NA)
for(cutpoint in sort(unique(ind.var))){
# partition data
dat1 <- dat[dat$ind.var > cutpoint,]
dat2 <- dat[!(dat$ind.var > cutpoint),]
if(nrow(dat1) < 2 | nrow(dat2) < 2){
next
}
# estimate
mod1 <- lm(dep.var ~ ind.var, dat = dat1)
mod2 <- lm(dep.var ~ ind.var, dat = dat2)
# calculate RSS
part1.RSS <- sum((dat1$dep.var - (mod1$coefficients['(Intercept)'] + dat1$ind.var * mod1$coefficients['ind.var'])) ^ 2)
part2.RSS <- sum((dat2$dep.var - (mod2$coefficients['(Intercept)'] + dat2$ind.var * mod2$coefficients['ind.var'])) ^ 2)
total <- part1.RSS + part2.RSS
if(total < best.cutpoint$RSS){
best.cutpoint <- list(RSS = total, cutpoint = cutpoint)
}
}
从以下可能值范围生成以下结果。
> print(best.cutpoint)
$RSS
[1] 75241532557
$cutpoint
[1] 34351.46
> range(dat$ind.var)
[1] 66.73151 49996.52975
我觉得你在问如何确定 segmented or piecewise linear regression 的断点。如果情况并非如此,请告诉我。
该软件包可用于此目的Segmented
首先让我们生成一些数据:
x<-seq(1:20)
y<-c(seq(1:10),seq(10,100,by=10))
plot(x,y)
这个数据看起来像,
很明显 "breakpoint" 在哪里。
接下来,让我们用分段包拟合模型:
library(segmented)
lin.mod <- lm(y~x)
segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=14)
找到断点了吗?
plot(segmented.mod)
points(x,y)
看起来确实如此。
> segmented.mod
Call: segmented.lm(obj = lin.mod, seg.Z = ~x, psi = 14)
Meaningful coefficients of the linear terms:
(Intercept) x U1.x
0.1818 0.9545 9.0455
Estimated Break-Point(s) psi1.x : 11.08
其中 seg.z 和 psi 定义为:
seg.Z a formula with no response variable, such as seg.Z=~x1+x2, indicating the
(continuous) explanatory variables having segmented relationships with the response.
Currently, formulas involving functions, such as seg.Z=~log(x1) or
seg.Z=~sqrt(x1), or selection operators, such as seg.Z=~d[,"x1"] or seg.Z=~d$x1,
are not allowed.
psi named list of vectors. The names have to match the variables of the seg.Z
argument. Each vector includes starting values for the break-point(s) for the
corresponding variable in seg.Z. If seg.Z includes only a variable, psi may be
a numeric vector. A NA value means that ‘K’ quantiles (or equally spaced values)
are used as starting values; K is fixed via the seg.control auxiliary function.
将一组数据分成两组的基本算法是什么s.t。两个单独的残差平方和之和最小化?例如,考虑下面的代码。基本上,如何在不迭代测试每个可能值的情况下计算存储在 best.cutpoint$RSS
中的值?
set.seed(1)
ind.var <- runif(1000, 1, 50000)
dep.var <- ind.var * runif(1000, 2, 3) + rnorm(1000, 100, 500)
dat <- data.frame(ind.var, dep.var)
best.cutpoint <- list(RSS = Inf, cutpoint = NA)
for(cutpoint in sort(unique(ind.var))){
# partition data
dat1 <- dat[dat$ind.var > cutpoint,]
dat2 <- dat[!(dat$ind.var > cutpoint),]
if(nrow(dat1) < 2 | nrow(dat2) < 2){
next
}
# estimate
mod1 <- lm(dep.var ~ ind.var, dat = dat1)
mod2 <- lm(dep.var ~ ind.var, dat = dat2)
# calculate RSS
part1.RSS <- sum((dat1$dep.var - (mod1$coefficients['(Intercept)'] + dat1$ind.var * mod1$coefficients['ind.var'])) ^ 2)
part2.RSS <- sum((dat2$dep.var - (mod2$coefficients['(Intercept)'] + dat2$ind.var * mod2$coefficients['ind.var'])) ^ 2)
total <- part1.RSS + part2.RSS
if(total < best.cutpoint$RSS){
best.cutpoint <- list(RSS = total, cutpoint = cutpoint)
}
}
从以下可能值范围生成以下结果。
> print(best.cutpoint)
$RSS
[1] 75241532557
$cutpoint
[1] 34351.46
> range(dat$ind.var)
[1] 66.73151 49996.52975
我觉得你在问如何确定 segmented or piecewise linear regression 的断点。如果情况并非如此,请告诉我。
该软件包可用于此目的Segmented
首先让我们生成一些数据:
x<-seq(1:20)
y<-c(seq(1:10),seq(10,100,by=10))
plot(x,y)
这个数据看起来像,
很明显 "breakpoint" 在哪里。
接下来,让我们用分段包拟合模型:
library(segmented)
lin.mod <- lm(y~x)
segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=14)
找到断点了吗?
plot(segmented.mod)
points(x,y)
看起来确实如此。
> segmented.mod
Call: segmented.lm(obj = lin.mod, seg.Z = ~x, psi = 14)
Meaningful coefficients of the linear terms:
(Intercept) x U1.x
0.1818 0.9545 9.0455
Estimated Break-Point(s) psi1.x : 11.08
其中 seg.z 和 psi 定义为:
seg.Z a formula with no response variable, such as seg.Z=~x1+x2, indicating the
(continuous) explanatory variables having segmented relationships with the response.
Currently, formulas involving functions, such as seg.Z=~log(x1) or
seg.Z=~sqrt(x1), or selection operators, such as seg.Z=~d[,"x1"] or seg.Z=~d$x1,
are not allowed.
psi named list of vectors. The names have to match the variables of the seg.Z
argument. Each vector includes starting values for the break-point(s) for the
corresponding variable in seg.Z. If seg.Z includes only a variable, psi may be
a numeric vector. A NA value means that ‘K’ quantiles (or equally spaced values)
are used as starting values; K is fixed via the seg.control auxiliary function.