来自随机均匀森林对象的预测值问题
Issue with prediction values from a random uniform forest object
我有一个数据框df
df<-structure(list(ID = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 8L, 8L), .Label = c("AU-Tum",
"AU-Wac", "BE-Bra", "BE-Jal", "BR-Cax", "BR-Sa3", "CA-Ca1", "CA-Ca2",
"CA-Ca3", "CA-Gro", "Ca-Man", "CA-NS1", "CA-NS2", "CA-NS3", "CA-NS4",
"CA-NS5", "CA-NS6", "CA-NS7", "CA-Oas", "CA-Obs", "CA-Ojp", "CA-Qcu",
"CA-Qfo", "CA-SF1", "CA-SF2", "CA-SF3", "CA-SJ1", "CA-SJ2", "CA-SJ3",
"CA-TP1", "CA-TP2", "CA-TP4", "CN-Cha", "CN-Ku1", "CZ-Bk1", "De-Bay",
"DE-Har", "DE-Tha", "DE-Wet", "DK-Sor", "FI-Hyy", "FI-Sod", "FR-Hes",
"FR-Pue", "GF-Guy", "ID-Pag", "IL-Yat", "IT-Col", "IT-Lav", "IT-Non",
"IT-Ro1", "IT-Ro2", "IT-Sro", "JP-Tak", "JP-Tef", "JP-Tom", "NL-Loo",
"PT-Esp", "RU-Fyo", "SE-Abi", "SE-Fla", "SE-Nor", "SE-Sk1", "SE-Sk2",
"SE-St1", "UK-Gri", "UK-Ham", "US-Blo", "US-Bn1", "US-Bn2", "Us-Bn3",
"US-Dk3", "US-Fmf", "US-Fwf", "US-Ha1", "US-Ha2", "US-Ho1", "US-Ho2",
"US-Lph", "US-Me1", "US-Me3", "US-Nc2", "US-NR1", "US-Oho", "US-Sp1",
"US-Sp2", "US-Sp3", "US-Syv", "US-Umb", "US-Wcr", "US-Wi0", "US-Wi1",
"US-Wi2", "US-Wi4", "US-Wi8", "VU-Coc", "Austin", "Caxiuana",
"Mae Klong", "Niwot Ridge", "Sky Oaks old", "Sky Oaks young",
"Sodankylä", "Tomakomai", "Yenisey Abies", "Yenisey Betula",
"Yenisey Mixed"), class = "factor"), P = c(1241.59999960661,
1282.40000277758, 0, 895, 0, 960.399999260902, 988.300011262298,
778.211069688201, 0, 676.725008800626, 1750.51986303926, 1614.11541634798,
951.847023338079, 1119.3682884872, 1112.38984390156, 1270.65773075982,
1234.72262170166, 1338.46096616983, 1136.69287367165, 1265.46480803983
), Te = c(9.20406423444821, 9.58323018294185, NaN, 12.1362462834136,
NaN, 10.6474634506736, 10.2948508957069, 11.3363996068107, NaN,
11.9457507949986, 9.10006221322572, 7.65505467142961, 8.21480062559674,
8.09251754304318, 8.466220758789, 8.48094407814006, 8.77304120569444,
8.31727518543397, 8.80921738865237, 9.04091478341757), Y = c(2172.34112930298,
2479.44521586597, 1027.63470042497, 2342.35314202309, 868.4010617733,
1157.13594430499, 1118.60130960867, 1100.47051284742, 1072.57190890331,
1228.25697739795, 2268.14043972082, 2147.62290922552, 2269.1387550775,
2247.31983098201, 1903.39138268307, 2174.78291538358, 2359.51909126411,
2488.39004804939, 461.398994384333, 567.150629704352)), .Names = c("ID",
"P", "Te", "Y"), row.names = 307:326, class = "data.frame")
我正在尝试为随机均匀森林分析设置留一 ID 交叉验证。基本上,我想要一个循环,每次我删除具有相同 ID 的数据并使用其他 ID 训练我的模型。然后,我通过对我为训练删除的数据进行预测来测试我的模型。我已遵循 post(https://stats.stackexchange.com/questions/109340/leave-one-subject-out-cross-validation-in-caret) 中的建议。因此,我得到了类似的东西:
library (randomUniformForest)
df<-na.omit(df)
subs <- unique(df$ID)
model_these <- vector(mode = "list", length = length(subs))
test_these <- vector(mode = "list", length = length(subs))
for(i in seq_along(subs)){
model_these[[i]] <- which(df$ID != subs[i])
names(model_these) <- paste0("ID", subs)
test_these[[i]] <- which(df$ID == subs[i])
names(test_these) <- paste0("ID", subs)
svmFit <- randomUniformForest(Y~ P+Te,
data= df,
importance = T,
ntree = 100,
trControl = trainControl(method = "cv",
index = model_these,
classProbs = TRUE))
ruf_pred = predict(object = svmFit, X = df, index= test_these)
}
循环似乎有效,但是预测值都一样,这显然没有任何意义。
[1] 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212
[16] 1940.212 1940.212
有人知道我做错了什么吗?
您似乎正在尝试使用一些特定于 caret
包的 train
方法的参数,这就是您链接的示例所指的内容。例如,randomUniformForest
不接受 trControl
参数。
对于您的情况,以下几行应该有效:
library (randomUniformForest)
df$prediction <- NA
for(id in unique(df$ID)){
train.df <- df[df$ID != id,]
test.df <- df[df$ID == id,c("P", "Te")]
svmFit <- randomUniformForest(Y ~ P+Te,
data = train.df,
importance = T,
ntree = 100)
ruf_pred = predict(object = svmFit, X = test.df)
df$prediction[df$ID == id] <- ruf_pred
}
不过请注意,您使用的观察结果很少 - randomUniformForest
方法会发出相应的警告 - 并且预测非常不准确。
我有一个数据框df
df<-structure(list(ID = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 8L, 8L), .Label = c("AU-Tum",
"AU-Wac", "BE-Bra", "BE-Jal", "BR-Cax", "BR-Sa3", "CA-Ca1", "CA-Ca2",
"CA-Ca3", "CA-Gro", "Ca-Man", "CA-NS1", "CA-NS2", "CA-NS3", "CA-NS4",
"CA-NS5", "CA-NS6", "CA-NS7", "CA-Oas", "CA-Obs", "CA-Ojp", "CA-Qcu",
"CA-Qfo", "CA-SF1", "CA-SF2", "CA-SF3", "CA-SJ1", "CA-SJ2", "CA-SJ3",
"CA-TP1", "CA-TP2", "CA-TP4", "CN-Cha", "CN-Ku1", "CZ-Bk1", "De-Bay",
"DE-Har", "DE-Tha", "DE-Wet", "DK-Sor", "FI-Hyy", "FI-Sod", "FR-Hes",
"FR-Pue", "GF-Guy", "ID-Pag", "IL-Yat", "IT-Col", "IT-Lav", "IT-Non",
"IT-Ro1", "IT-Ro2", "IT-Sro", "JP-Tak", "JP-Tef", "JP-Tom", "NL-Loo",
"PT-Esp", "RU-Fyo", "SE-Abi", "SE-Fla", "SE-Nor", "SE-Sk1", "SE-Sk2",
"SE-St1", "UK-Gri", "UK-Ham", "US-Blo", "US-Bn1", "US-Bn2", "Us-Bn3",
"US-Dk3", "US-Fmf", "US-Fwf", "US-Ha1", "US-Ha2", "US-Ho1", "US-Ho2",
"US-Lph", "US-Me1", "US-Me3", "US-Nc2", "US-NR1", "US-Oho", "US-Sp1",
"US-Sp2", "US-Sp3", "US-Syv", "US-Umb", "US-Wcr", "US-Wi0", "US-Wi1",
"US-Wi2", "US-Wi4", "US-Wi8", "VU-Coc", "Austin", "Caxiuana",
"Mae Klong", "Niwot Ridge", "Sky Oaks old", "Sky Oaks young",
"Sodankylä", "Tomakomai", "Yenisey Abies", "Yenisey Betula",
"Yenisey Mixed"), class = "factor"), P = c(1241.59999960661,
1282.40000277758, 0, 895, 0, 960.399999260902, 988.300011262298,
778.211069688201, 0, 676.725008800626, 1750.51986303926, 1614.11541634798,
951.847023338079, 1119.3682884872, 1112.38984390156, 1270.65773075982,
1234.72262170166, 1338.46096616983, 1136.69287367165, 1265.46480803983
), Te = c(9.20406423444821, 9.58323018294185, NaN, 12.1362462834136,
NaN, 10.6474634506736, 10.2948508957069, 11.3363996068107, NaN,
11.9457507949986, 9.10006221322572, 7.65505467142961, 8.21480062559674,
8.09251754304318, 8.466220758789, 8.48094407814006, 8.77304120569444,
8.31727518543397, 8.80921738865237, 9.04091478341757), Y = c(2172.34112930298,
2479.44521586597, 1027.63470042497, 2342.35314202309, 868.4010617733,
1157.13594430499, 1118.60130960867, 1100.47051284742, 1072.57190890331,
1228.25697739795, 2268.14043972082, 2147.62290922552, 2269.1387550775,
2247.31983098201, 1903.39138268307, 2174.78291538358, 2359.51909126411,
2488.39004804939, 461.398994384333, 567.150629704352)), .Names = c("ID",
"P", "Te", "Y"), row.names = 307:326, class = "data.frame")
我正在尝试为随机均匀森林分析设置留一 ID 交叉验证。基本上,我想要一个循环,每次我删除具有相同 ID 的数据并使用其他 ID 训练我的模型。然后,我通过对我为训练删除的数据进行预测来测试我的模型。我已遵循 post(https://stats.stackexchange.com/questions/109340/leave-one-subject-out-cross-validation-in-caret) 中的建议。因此,我得到了类似的东西:
library (randomUniformForest)
df<-na.omit(df)
subs <- unique(df$ID)
model_these <- vector(mode = "list", length = length(subs))
test_these <- vector(mode = "list", length = length(subs))
for(i in seq_along(subs)){
model_these[[i]] <- which(df$ID != subs[i])
names(model_these) <- paste0("ID", subs)
test_these[[i]] <- which(df$ID == subs[i])
names(test_these) <- paste0("ID", subs)
svmFit <- randomUniformForest(Y~ P+Te,
data= df,
importance = T,
ntree = 100,
trControl = trainControl(method = "cv",
index = model_these,
classProbs = TRUE))
ruf_pred = predict(object = svmFit, X = df, index= test_these)
}
循环似乎有效,但是预测值都一样,这显然没有任何意义。
[1] 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212 1940.212
[16] 1940.212 1940.212
有人知道我做错了什么吗?
您似乎正在尝试使用一些特定于 caret
包的 train
方法的参数,这就是您链接的示例所指的内容。例如,randomUniformForest
不接受 trControl
参数。
对于您的情况,以下几行应该有效:
library (randomUniformForest)
df$prediction <- NA
for(id in unique(df$ID)){
train.df <- df[df$ID != id,]
test.df <- df[df$ID == id,c("P", "Te")]
svmFit <- randomUniformForest(Y ~ P+Te,
data = train.df,
importance = T,
ntree = 100)
ruf_pred = predict(object = svmFit, X = test.df)
df$prediction[df$ID == id] <- ruf_pred
}
不过请注意,您使用的观察结果很少 - randomUniformForest
方法会发出相应的警告 - 并且预测非常不准确。