Leave one out cross validation 通过在训练过程中遗漏两个 ID

Question

我有一个数据框df

df<-structure(list(ID = c(4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 
6, 6, 6, 6, 8, 8, 8, 9, 9), Y = c(2268.14043972082, 2147.62290922552, 
2269.1387550775, 2247.31983098201, 1903.39138268307, 2174.78291538358, 
2359.51909126411, 2488.39004804939, 212.851575751527, 461.398994384333, 
567.150629704352, 781.775113821961, 918.303706148872, 1107.37695799186, 
1160.80594193377, 1412.61328924168, 1689.48879626486, 685.154353165934, 
574.088067465695, 650.30821636616, 494.185166497016, 436.312162090908
), P = c(1750.51986303926, 1614.11541634798, 951.847023338079, 
1119.3682884872, 1112.38984390156, 1270.65773075982, 1234.72262170166, 
1338.46096616983, 1198.95775346458, 1136.69287367165, 1265.46480803983, 
1364.70149818063, 1112.37006707489, 1346.49240261316, 1740.56677791104, 
1410.99217295647, 1693.18871380948, 275.447173420805, 396.449789014179, 
251.609239829704, 215.432550271042, 55.5336257666349), A = c(49, 
50, 51, 52, 53, 54, 55, 56, 1, 2, 3, 4, 5, 14, 15, 16, 17, 163, 
164, 165, 153, 154), TA = c(9.10006221322572, 7.65505467142961, 
8.21480062559674, 8.09251754304318, 8.466220758789, 8.48094407814006, 
8.77304120569444, 8.31727518543397, 8.14410265791868, 8.80921738865237, 
9.04091478341757, 9.66233618146246, 8.77015716015164, 9.46037931956657, 
9.59702379240667, 10.1739258740118, 9.39524442215692, -0.00568604734662462, 
-2.12940164413048, -0.428603434930109, 1.52337963973006, -1.04714984064565
), TS = c(9.6499861763085, 7.00622420539595, 7.73511170298675, 
7.68006974050443, 8.07442411510912, 8.27687965909096, 8.76025039592727, 
8.3345638889156, 9.23658956753677, 8.98160722605782, 8.98234210211611, 
9.57066566368204, 8.74444401914267, 8.98719629775988, 9.18169205278566, 
9.98225438314085, 9.56196773059615, 5.47788158053928, 2.58106090926808, 
3.22420704848299, 1.36953555753786, 0.241334267522977), R = c(11.6679680423377, 
11.0166459173372, 11.1851268491296, 10.7404563561694, 12.1054055597684, 
10.9551321815546, 11.1975918244469, 10.7242192465965, 10.1661703705992, 
11.4840412725324, 11.1248456370953, 11.2529612597628, 10.7694642397996, 
12.3300887767583, 12.0478558531771, 12.3212362249214, 11.5650773932264, 
9.56070414783612, 9.61762902218185, 10.2076240621201, 11.8234628013552, 
10.9184029778985)), .Names = c("ID", "Y", "P", "A", "TA", "TS", 
"R"), na.action = structure(77:78, .Names = c("77", "78"), class = "omit"), row.names = c(NA, 
22L), class = "data.frame")

我目前正在留一交叉验证模式下进行线性回归。换句话说，在训练期间，我为每次迭代删除一个站点，并在遗漏的站点上测试模型。程序见下：

df$prediction <- NA
for(id in unique(df$ID)){
  train.df <- df[df$ID != id,]
  test.df <- df[df$ID == id, c("P", "A", "TA", "TS","R")]
  lm.df<- glm(Y ~ P+A+TA+TS+R, data=train.df)
  step.df<- step(lm.df, direction = "backward")
  df.pred = predict(object = step.df, newdata = test.df)
  df$prediction[df$ID== id] <- df.pred
}

但是，我想在交叉验证期间为每次迭代删除 2 个 ID，而不是一个。因此，我的测试集每次都会包含两个 ID，而不是一个。有人知道我该怎么做吗？

Answer 1

如果您将 == 更改为 %in% 并将 unique(df$ID) 更改为 split(unique(df$ID), c(1,1,2,2,3)) 它似乎可以正常工作。本质上，在每次迭代中你传递两个 id 而不是一个，所以 test.df 集合包含这两个。

看到这个：

df$prediction <- NA
for(id in split(unique(df$ID), c(1,1,2,2,3))){
  print(id)
  train.df <- df[!df$ID %in% id,]
  test.df <- df[df$ID %in% id, c("P", "A", "TA", "TS","R")]
  lm.df<- glm(Y ~ P+A+TA+TS+R, data=train.df)
  step.df<- step(lm.df, direction = "backward",trace=0)
  df.pred = predict(object = step.df, newdata = test.df)
  df$prediction[df$ID %in% id] <- df.pred
}

输出：

[1] 4 5
[1] 6 8
[1] 9

我已将上面的跟踪设置为零，以便它只打印循环中传递的 ID。如您所见，您有两个而不是一个（显然除了最后一个）。 split 将向量 unique(df$ID) 拆分为 2 元素片段，然后我们可以在循环中使用它们。

Leave one out cross validation 通过在训练过程中遗漏两个 ID

Leave one out cross validation by leaving out two ID during the training process

r

cross-validation