如何从线性拟合中提取线数和相应的方程式

How can I extract the number of lines and the corresponding equations from a linear fit

我有数据,我希望有几个

形式的线性相关
y_i = a_i + b_i * t_i,      i = 1 .. N

其中 N 是先验未知的。问题的简短版本是:Given a fit

在下面的可重现示例中,我有数据 (t,y) 以及相应的参数 p1(级别 p1_1p1_2)和 p2(级别 p2_1p2_2p2_3)。 因此,数据看起来像 (t, y, p1, p2),其中最多 2*3 条不同的最佳拟合线,而来自的线性拟合最多 2*2*3个非零系数。

我运行进入以下问题:假设我有三个方程

y1 = 5 + 3*t (for p1=p1_1, p2=p2_1)
y2 = 3 + t   (for p1=p1_2, p2=p2_2)
y3 = 1 – t   (for p1=p1_2, p2=p2_3)

运行 cv.glmnet(y ~ t * p1 * p2, ...) 产量

(Intercept)        5
t                  3 => y1 = 5 + 3t
p1p1_2            -2 => y2 = 3 + 3t?
p2p2_2             .
p2p2_3            -2 => y3 = 1 + 3t?
t:p1p1_2          -2 => y4 = 3 + t (or y4 = 1 + t?)
t:p2p2_2           .
t:p2p2_3          -2 => y5 = 1 - t
p1p1_2:p2p2_2      .
p1p1_2:p2p2_3   -0.1 => y6 = 0.9 – t?
t:p1p1_2:p2p2_2    . 
t:p1p1_2:p2p2_3    .

期望的结果:程序应该建议 4 个方程 y1,正确的 y4、y5 和 y6,希望有一个很好的理由(哪个?)忽略 y6。

运行 lm(y ~ t * p1 * p2) 产量

(Intercept)        5
t                  3 => y1 = 5 + 3t
p1p1_2            -4 => y2 = 1 + 3t?
p2p2_2             2 => y3 = 3 + 3t
p2p2_3             .
t:p1p1_2          -4 => y5 = 1 - x (or y4 = 3 - t?)
t:p2p2_2           2 => y6 = 3 + t?
t:p2p2_3           .
p1p1_2:p2p2_2      .
p1p1_2:p2p2_3      .
t:p1p1_2:p2p2_2    .
t:p1p1_2:p2p2_3    .

想要的结果:程序应该建议 3 个方程 y1、y3 和 y6

我是否忽略了一些明显的东西?

可重现的例子

第三列是包含噪声的虚拟因子。为简单起见,不考虑此列

# Create testdata
sigma <- 0.5
t <- seq(0,10, length.out = 1000) # large sample of x values
# Create 3 linear equations of the form y_i = a*t_i + b
a <- c(3, 1, -1) # slope
b <- c(5, 3, 1) # offset

# create t_i, y_ti (theory) and y_i (including noise)
d <- list()
y <- list()
y_t <- list()
for (i in 1:3) {
  set.seed(33*i)
  d[[i]] <- sort(sample(t, 50, replace = F))
  set.seed(33*i)
  noise <- rnorm(10, 0, sigma)
  y[[i]] <- a[i]*d[[i]] + b[i] + noise
  y_t[[i]] <- a[i]*d[[i]] + b[i]
}
# Final data set
df1 <- data.frame(t=d[[1]], y=y[[1]], p1=rep("p1_1"), p2=rep("p2_1"), 
                  p3=sample(c("p3_1", "p3_2", "p3_3"), length(d[[1]]), replace = T))
df2 <- data.frame(t=d[[2]], y=y[[2]], p1=rep("p1_2"), p2=rep("p2_2"), 
                  p3=sample(c("p3_1", "p3_2", "p3_3"), length(d[[1]]), replace = T))
df3 <- data.frame(t=d[[3]], y=y[[3]], p1=rep("p1_2"), p2=rep("p2_3"), 
                  p3=sample(c("p3_1", "p3_2", "p3_3"), length(d[[1]]), replace = T))
mydata <- rbind(df1, df2, df3)
mydata$p1 <- factor(mydata$p1)
mydata$p2 <- factor(mydata$p2)
mydata$p3 <- factor(mydata$p3)
mydata <- mydata[sample(nrow(mydata)), ]

# What the raw data looks like:
plot(x = mydata$t, y = mydata$y)

cols <- rainbow(length(levels(mydata$p1))*length(levels(mydata$p2))*length(levels(mydata$p3)))
rm(.Random.seed, envir=.GlobalEnv)
cols <- sample(cols) # most likely similar colors are not next to each other;-)

# Fit using lm disabled - just uncomment and comment the part below
# fit <- lm(y ~ t * p1 * p2, data = mydata)
# coef <- as.matrix(fit$coefficients)
# mydata$pred <- predict(fit)

# Fit using glmnet
set.seed(42)
fit_type <- c("lambda.min", "lambda.1se")[1]
x <- model.matrix(y ~ t * p1 * p2, data = mydata)[,-1]
fit <- glmnet::cv.glmnet(x, mydata$y, intercept = TRUE, nfolds = 10, alpha = 1)
coef <- glmnet::coef.cv.glmnet(fit, newx = x, s = fit_type)
mydata$pred <- predict(fit, newx = x, s = fit_type)

# plots
plot(d[[1]], y_t[[1]], type = "l", lty = 3, col = "black", main = "Raw data", 
     xlim = c(0, 10), ylim = c(min(mydata$y), max(mydata$y)), xlab = "t", ylab = "y")
lines(d[[2]], y_t[[2]], col = "black", lty = 3)
lines(d[[3]], y_t[[3]], col = "black", lty = 3)

# The following for loops are fixed right now. In the end this should be automated using 
# the input from the fit (and the knowledge how to extract N and the lines above).
pn <- 0
for (p1 in 1:length(levels(mydata$p1))) {
  for (p2 in 1:length(levels(mydata$p2))) {
    pn <- pn + 1
    tmp <- mydata[mydata$p1 == levels(mydata$p1)[p1] & mydata$p2 == levels(mydata$p2)[p2], ]
    points(x = tmp$t, y = tmp$y, col = cols[pn]) # original data
    points(x = tmp$t, y = tmp$pred, col = cols[pn], pch = 3) # estimated data from predict
    if (length(tmp$pred) > 0) {
      abline(lm(tmp$pred ~ tmp$t), col = cols[pn])
    }
  }
}

相关帖子:

我认为您误解了回归结果。如果方程包含项 p1_mp2_n,则它还必须包含交互作用 t:p1_mt:p2_n。它不能是一个而不是另一个。样本数据中有三对系数:

> unique(mydata[,3:4])
#       p1   p2
# 96  p1_2 p2_2
# 1   p1_1 p2_1
# 135 p1_2 p2_3

查看 lm 结果,我们将方程重构为:

  1. y = 5 + 3t + p1p1_2 + (t:p1p1_2)*t + p2p2_2 + (t:p2p2_2)*t = 3 + t;
  2. y = 5 + 3t + p1p1_1 + (t:p1p1_1)*t + p2p2_1 + (t:p2p2_1)*t = 5 + 3t;
  3. y = 5 + 3t + p1p1_2 + (t:p1p1_2)*t + p2p2_3 + (t:p2p2_3)*t = 1 - t.

这些与您在开头指定的方程式相匹配,因此没有歧义。