使用留一 ID 交叉验证计算随机森林
Compute Random Forest with a leave one ID out cross validation
我有一个数据框df
dput(df)
structure(list(ID = c(4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5,
6, 6, 6, 6, 8, 8, 8, 9, 9), Y = c(2268.14043972082, 2147.62290922552,
2269.1387550775, 2247.31983098201, 1903.39138268307, 2174.78291538358,
2359.51909126411, 2488.39004804939, 212.851575751527, 461.398994384333,
567.150629704352, 781.775113821961, 918.303706148872, 1107.37695799186,
1160.80594193377, 1412.61328924168, 1689.48879626486, 685.154353165934,
574.088067465695, 650.30821636616, 494.185166497016, 436.312162090908
), P = c(1750.51986303926, 1614.11541634798, 951.847023338079,
1119.3682884872, 1112.38984390156, 1270.65773075982, 1234.72262170166,
1338.46096616983, 1198.95775346458, 1136.69287367165, 1265.46480803983,
1364.70149818063, 1112.37006707489, 1346.49240261316, 1740.56677791104,
1410.99217295647, 1693.18871380948, 275.447173420805, 396.449789014179,
251.609239829704, 215.432550271042, 55.5336257666349), A = c(49,
50, 51, 52, 53, 54, 55, 56, 1, 2, 3, 4, 5, 14, 15, 16, 17, 163,
164, 165, 153, 154), TA = c(9.10006221322572, 7.65505467142961,
8.21480062559674, 8.09251754304318, 8.466220758789, 8.48094407814006,
8.77304120569444, 8.31727518543397, 8.14410265791868, 8.80921738865237,
9.04091478341757, 9.66233618146246, 8.77015716015164, 9.46037931956657,
9.59702379240667, 10.1739258740118, 9.39524442215692, -0.00568604734662462,
-2.12940164413048, -0.428603434930109, 1.52337963973006, -1.04714984064565
), TS = c(9.6499861763085, 7.00622420539595, 7.73511170298675,
7.68006974050443, 8.07442411510912, 8.27687965909096, 8.76025039592727,
8.3345638889156, 9.23658956753677, 8.98160722605782, 8.98234210211611,
9.57066566368204, 8.74444401914267, 8.98719629775988, 9.18169205278566,
9.98225438314085, 9.56196773059615, 5.47788158053928, 2.58106090926808,
3.22420704848299, 1.36953555753786, 0.241334267522977), R = c(11.6679680423377,
11.0166459173372, 11.1851268491296, 10.7404563561694, 12.1054055597684,
10.9551321815546, 11.1975918244469, 10.7242192465965, 10.1661703705992,
11.4840412725324, 11.1248456370953, 11.2529612597628, 10.7694642397996,
12.3300887767583, 12.0478558531771, 12.3212362249214, 11.5650773932264,
9.56070414783612, 9.61762902218185, 10.2076240621201, 11.8234628013552,
10.9184029778985)), .Names = c("ID", "Y", "P", "A", "TA", "TS",
"R"), na.action = structure(77:78, .Names = c("77", "78"), class = "omit"), row.names = c(NA,
22L), class = "data.frame")
我想 运行 在此数据集上使用留一 ID 交叉验证的 RandomForest。因此,我不希望交叉验证是随机的。对于每个 运行,我想省略具有相同 ID 值的数据,因为具有相同 ID 的数据不是独立的。这意味着具有相同 ID 的数据将具有相同的交叉验证索引。例如,第一个 运行 将在 ID=5、6、8、9 的数据上进行训练,并将在 ID=4 的数据上进行测试,第二个 运行 将在ID=4,6,8,9 的数据,将在 ID=5 的数据上进行测试,依此类推。有人知道如何在 R 中实现它吗?以下是我尝试过但不确定它在概念上是否正确的命令行。
# Create Training dataset
df<-na.omit(df)
tvec<-unique(df$ID)
nruns <- length(tvec)
crossclass<-sample(nruns,length(tvec),TRUE)
nobs<-nrow(df)
crossPredict<-rep(NA,nobs)
#Run a RandomForest with leave one out ID CV
for (i in 1:nruns) {
indtrain<-which(df$ID %in% tvec[!crossclass==i])
indvalidate<-setdiff(1:nobs,indtrain)
rf<-randomForest(formula = Y ~ P + TA + TS + R + A, data=df, subset=indtrain,ntree=10000)
crossPredict[indvalidate]<-predict(rf,df[indvalidate,])
}
我想你问的是这个:
需要与训练集中不同的 ID 一样多的交叉验证 运行。因此,我们将这些 ID 收集到向量 uniqueIDs
中,然后将每个训练观察值与向量 crossclass
中的正确 运行 相关联。像这样:
uniqueIDs <- unique(train$ID)
nruns <- length(uniqueIDs) # number of cross validation runs: one for each unique ID
crossclass <- match(train$ID, uniqueIDs)
您编写的主要交叉验证循环保持不变。 (我只添加了一些调试输出。)
nobs <- nrow(na.omit(train))
crossPredict <- rep(NA, nobs)
for (i in 1:nruns) {
indtrain <- which(crossclass != i)
indvalidate <- setdiff(1:nobs, indtrain)
cat("Run", i, ": training only on observations with ID not", uniqueIDs[i], "\n")
cat(" IDs in training set:", train[indtrain,"ID"], "\n")
cat(" IDs in validation set:", train[indvalidate,"ID"], "\n")
rf_df_CV <- randomForest(Y ~ ., data = train[indtrain,],
ntree = 1000, importance = T, na.action = "na.omit")
crossPredict[indvalidate] <- predict(rf_df_CV, train[indvalidate,])
}
这是一个示例输出:
Run 1 : training only on observations with ID not 4
IDs in training set: 5 8 6 6 5 5 8 6
IDs in validation set: 4 4 4 4 4 4 4
Run 2 : training only on observations with ID not 5
IDs in training set: 4 4 8 4 6 6 4 4 4 4 8 6
IDs in validation set: 5 5 5
Run 3 : training only on observations with ID not 8
IDs in training set: 4 4 5 4 6 6 5 5 4 4 4 4 6
IDs in validation set: 8 8
Run 4 : training only on observations with ID not 6
IDs in training set: 4 4 5 8 4 5 5 4 4 4 4 8
IDs in validation set: 6 6 6
请注意,在这个示例中,将 df
随机拆分为训练集和测试集,训练集恰好没有任何 ID = 9 的观察值。因此也没有 CV 运行 这个ID...
我有一个数据框df
dput(df)
structure(list(ID = c(4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5,
6, 6, 6, 6, 8, 8, 8, 9, 9), Y = c(2268.14043972082, 2147.62290922552,
2269.1387550775, 2247.31983098201, 1903.39138268307, 2174.78291538358,
2359.51909126411, 2488.39004804939, 212.851575751527, 461.398994384333,
567.150629704352, 781.775113821961, 918.303706148872, 1107.37695799186,
1160.80594193377, 1412.61328924168, 1689.48879626486, 685.154353165934,
574.088067465695, 650.30821636616, 494.185166497016, 436.312162090908
), P = c(1750.51986303926, 1614.11541634798, 951.847023338079,
1119.3682884872, 1112.38984390156, 1270.65773075982, 1234.72262170166,
1338.46096616983, 1198.95775346458, 1136.69287367165, 1265.46480803983,
1364.70149818063, 1112.37006707489, 1346.49240261316, 1740.56677791104,
1410.99217295647, 1693.18871380948, 275.447173420805, 396.449789014179,
251.609239829704, 215.432550271042, 55.5336257666349), A = c(49,
50, 51, 52, 53, 54, 55, 56, 1, 2, 3, 4, 5, 14, 15, 16, 17, 163,
164, 165, 153, 154), TA = c(9.10006221322572, 7.65505467142961,
8.21480062559674, 8.09251754304318, 8.466220758789, 8.48094407814006,
8.77304120569444, 8.31727518543397, 8.14410265791868, 8.80921738865237,
9.04091478341757, 9.66233618146246, 8.77015716015164, 9.46037931956657,
9.59702379240667, 10.1739258740118, 9.39524442215692, -0.00568604734662462,
-2.12940164413048, -0.428603434930109, 1.52337963973006, -1.04714984064565
), TS = c(9.6499861763085, 7.00622420539595, 7.73511170298675,
7.68006974050443, 8.07442411510912, 8.27687965909096, 8.76025039592727,
8.3345638889156, 9.23658956753677, 8.98160722605782, 8.98234210211611,
9.57066566368204, 8.74444401914267, 8.98719629775988, 9.18169205278566,
9.98225438314085, 9.56196773059615, 5.47788158053928, 2.58106090926808,
3.22420704848299, 1.36953555753786, 0.241334267522977), R = c(11.6679680423377,
11.0166459173372, 11.1851268491296, 10.7404563561694, 12.1054055597684,
10.9551321815546, 11.1975918244469, 10.7242192465965, 10.1661703705992,
11.4840412725324, 11.1248456370953, 11.2529612597628, 10.7694642397996,
12.3300887767583, 12.0478558531771, 12.3212362249214, 11.5650773932264,
9.56070414783612, 9.61762902218185, 10.2076240621201, 11.8234628013552,
10.9184029778985)), .Names = c("ID", "Y", "P", "A", "TA", "TS",
"R"), na.action = structure(77:78, .Names = c("77", "78"), class = "omit"), row.names = c(NA,
22L), class = "data.frame")
我想 运行 在此数据集上使用留一 ID 交叉验证的 RandomForest。因此,我不希望交叉验证是随机的。对于每个 运行,我想省略具有相同 ID 值的数据,因为具有相同 ID 的数据不是独立的。这意味着具有相同 ID 的数据将具有相同的交叉验证索引。例如,第一个 运行 将在 ID=5、6、8、9 的数据上进行训练,并将在 ID=4 的数据上进行测试,第二个 运行 将在ID=4,6,8,9 的数据,将在 ID=5 的数据上进行测试,依此类推。有人知道如何在 R 中实现它吗?以下是我尝试过但不确定它在概念上是否正确的命令行。
# Create Training dataset
df<-na.omit(df)
tvec<-unique(df$ID)
nruns <- length(tvec)
crossclass<-sample(nruns,length(tvec),TRUE)
nobs<-nrow(df)
crossPredict<-rep(NA,nobs)
#Run a RandomForest with leave one out ID CV
for (i in 1:nruns) {
indtrain<-which(df$ID %in% tvec[!crossclass==i])
indvalidate<-setdiff(1:nobs,indtrain)
rf<-randomForest(formula = Y ~ P + TA + TS + R + A, data=df, subset=indtrain,ntree=10000)
crossPredict[indvalidate]<-predict(rf,df[indvalidate,])
}
我想你问的是这个:
需要与训练集中不同的 ID 一样多的交叉验证 运行。因此,我们将这些 ID 收集到向量 uniqueIDs
中,然后将每个训练观察值与向量 crossclass
中的正确 运行 相关联。像这样:
uniqueIDs <- unique(train$ID)
nruns <- length(uniqueIDs) # number of cross validation runs: one for each unique ID
crossclass <- match(train$ID, uniqueIDs)
您编写的主要交叉验证循环保持不变。 (我只添加了一些调试输出。)
nobs <- nrow(na.omit(train))
crossPredict <- rep(NA, nobs)
for (i in 1:nruns) {
indtrain <- which(crossclass != i)
indvalidate <- setdiff(1:nobs, indtrain)
cat("Run", i, ": training only on observations with ID not", uniqueIDs[i], "\n")
cat(" IDs in training set:", train[indtrain,"ID"], "\n")
cat(" IDs in validation set:", train[indvalidate,"ID"], "\n")
rf_df_CV <- randomForest(Y ~ ., data = train[indtrain,],
ntree = 1000, importance = T, na.action = "na.omit")
crossPredict[indvalidate] <- predict(rf_df_CV, train[indvalidate,])
}
这是一个示例输出:
Run 1 : training only on observations with ID not 4
IDs in training set: 5 8 6 6 5 5 8 6
IDs in validation set: 4 4 4 4 4 4 4
Run 2 : training only on observations with ID not 5
IDs in training set: 4 4 8 4 6 6 4 4 4 4 8 6
IDs in validation set: 5 5 5
Run 3 : training only on observations with ID not 8
IDs in training set: 4 4 5 4 6 6 5 5 4 4 4 4 6
IDs in validation set: 8 8
Run 4 : training only on observations with ID not 6
IDs in training set: 4 4 5 8 4 5 5 4 4 4 4 8
IDs in validation set: 6 6 6
请注意,在这个示例中,将 df
随机拆分为训练集和测试集,训练集恰好没有任何 ID = 9 的观察值。因此也没有 CV 运行 这个ID...