R 中 Caret 包中的 "downSample" 错误,哪个函数最好?
Error with "downSample" in Caret package in R and which function is best?
我正在尝试两种不平衡数据的采样方法。
我使用了“Caret”包的“upSample”功能,一切顺利。
但是,当我使用“downSample”函数时出现以下错误:
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
我使用的命令语法是:
downtrain_eli=downSample(x=trainset_eli[,-16],
y=trainset_eli$Comportamento)
"trainset_eli" 有 34 列和 70.800 行
因为我使用随机森林模型来预测多 class (6) 响应变量,所以我正在测试这两个函数(up 和 dowsample)以保持我的数据平衡。然而,我看到“Caret”包还包含“train”功能,有更多选项来平衡数据。但是这个函数是一个模型类型的函数,我只是想让这个函数创建一个具有平衡数据的数据集,然后在我的随机森林模型中使用它。继续使用“上下”功能还是使用“火车”功能对我来说更好?如果是这样,我该如何在我的随机森林模型中实现它?
str(trainset_eli)
$ date : chr "01/10/2019" "24/09/2019" "01/10/2019" "01/10/2019" ...
$ air.temp : num 18.4 32.6 34.5 26.4 32.6 ...
$ relat.u : num 70 30.4 22.2 50.7 30.8 ...
$ wind.sp : num 1.14 2.81 1.51 3.33 2.17 ...
$ wind.dir : num 79.1 341.6 350.1 56.2 294.9 ...
$ solar.rad : num 39.6 741 433.9 621.1 274.6 ...
$ max.raj : num 1.65 5.25 2.85 6.05 4.45 ...
$ time : chr "06:40:00" "14:10:00" "14:40:00" "09:20:00" ...
$ timedate : POSIXct, format: "2019-10-01 06:43:48" "2019-09-24 14:10:45" "2019-10-01 14:48:50" ...
$ sensorid : int 67 65 66 70 70 70 69 68 69 65 ...
$ x : int -56 -49 15 35 -4 27 -40 33 -29 -47 ...
$ y : int -11 0 -4 24 10 34 -43 4 -4 5 ...
$ z : int -27 -37 -56 -20 -16 -44 -51 -49 -53 -41 ...
$ i.date : chr "01/10/2019" "24/09/2019" "01/10/2019" "01/10/2019" ...
$ i.time : chr "06:43:48" "14:10:45" "14:48:50" "09:21:41" ...
$ Comportamento: Factor w/ 6 levels "1","2","4","5",..: 6 3 3 5 2 2 1 1 2 1 ...
$ xg : num -0.875 -0.7656 0.2344 0.5469 -0.0625 ...
$ yg : num -0.1719 0 -0.0625 0.375 0.1562 ...
$ zg : num -0.422 -0.578 -0.875 -0.312 -0.25 ...
$ SMA : num 1.469 1.344 1.172 1.234 0.469 ...
$ SVM : num 0.986 0.959 0.908 0.733 0.301 ...
$ mov.var : num 0.0625 0.1094 0.0469 1.0156 1 ...
$ energy : num 0.94701 0.84715 0.67974 0.28875 0.00825 ...
$ entropy : num 0.2526 0.1219 0.0354 0.8179 0.0172 ...
$ pitch : num 62.5 52.9 -15 -48.2 12 ...
$ roll : num -158 180 -176 130 148 ...
$ inclination : num -64.7 -52.9 -15.5 -64.8 -33.9 ...
$ year : num 2019 2019 2019 2019 2019 ...
$ month : num 10 9 10 10 9 10 10 10 10 10 ...
$ day : int 1 24 1 1 24 1 1 1 1 1 ...
$ dayofweek : num 3 3 3 3 3 3 3 3 3 3 ...
$ hour : int 6 14 14 9 16 13 6 16 7 6 ...
$ minute : int 43 10 48 21 38 35 43 48 20 36 ...
$ second : num 48 45 50 41 45 16 36 13 43 57 ...
> dput(head(trainset_eli))
structure(list(date = c("01/10/2019", "24/09/2019", "01/10/2019",
"01/10/2019", "24/09/2019", "01/10/2019"), air.temp = c(18.42,
32.63, 34.54, 26.42, 32.63, 34.44), relat.u = c(70, 30.45, 22.19,
50.69, 30.83, 25.67), wind.sp = c(1.136, 2.809, 1.512, 3.326,
2.171, 2.04), wind.dir = c(79.1, 341.6, 350.1, 56.22, 294.9,
16.57), solar.rad = c(39.62, 741, 433.9, 621.1, 274.6, 847),
max.raj = c(1.647, 5.247, 2.847, 6.047, 4.447, 4.447), time = c("06:40:00",
"14:10:00", "14:40:00", "09:20:00", "16:30:00", "13:30:00"
), timedate = structure(c(1569912228, 1569334245, 1569941330,
1569921701, 1569343125, 1569936916), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), sensorid = c(67L, 65L, 66L, 70L,
70L, 70L), x = c(-56L, -49L, 15L, 35L, -4L, 27L), y = c(-11L,
0L, -4L, 24L, 10L, 34L), z = c(-27L, -37L, -56L, -20L, -16L,
-44L), i.date = c("01/10/2019", "24/09/2019", "01/10/2019",
"01/10/2019", "24/09/2019", "01/10/2019"), i.time = c("06:43:48",
"14:10:45", "14:48:50", "09:21:41", "16:38:45", "13:35:16"
), Comportamento = structure(c(6L, 3L, 3L, 5L, 2L, 2L), .Label = c("1",
"2", "4", "5", "6", "7"), class = "factor"), xg = c(-0.875,
-0.765625, 0.234375, 0.546875, -0.0625, 0.421875), yg = c(-0.171875,
0, -0.0625, 0.375, 0.15625, 0.53125), zg = c(-0.421875, -0.578125,
-0.875, -0.3125, -0.25, -0.6875), SMA = c(1.46875, 1.34375,
1.171875, 1.234375, 0.46875, 1.640625), SVM = c(0.986480882354037,
0.959380089563047, 0.907999389110477, 0.733044006608744,
0.30136408628103, 0.965847466282849), mov.var = c(0.0625,
0.109375, 0.046875, 1.015625, 1, 0.078125), energy = c(0.947010278701782,
0.847154855728149, 0.679739058017731, 0.288748800754547,
0.00824832916259766, 0.870230257511139), entropy = c(0.252618304422212,
0.121902803377891, 0.0354050216019417, 0.817915633557388,
0.0171719387098626, 0.109209155417093), pitch = c(62.4975813343597,
52.9434718105904, -14.9586823290351, -48.247900416119, 11.9694631246073,
-25.8994130495892), roll = c(-157.833654177918, 180, -175.914383220025,
129.805571092265, 147.994616791916, 142.305759533311), inclination = c(-64.6810700998259,
-52.9434718105904, -15.4942996397858, -64.7667344528855,
-33.9462950277539, -44.6176169165428), year = c(2019, 2019,
2019, 2019, 2019, 2019), month = c(10, 9, 10, 10, 9, 10),
day = c(1L, 24L, 1L, 1L, 24L, 1L), dayofweek = c(3, 3, 3,
3, 3, 3), hour = c(6L, 14L, 14L, 9L, 16L, 13L), minute = c(43L,
10L, 48L, 21L, 38L, 35L), second = c(48, 45, 50, 41, 45,
16)), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x56139e8dcfc0>, class = c("data.table",
"data.frame"))
不太确定为什么它不起作用,如果我使用不平衡 class 的示例集,我的标签名为 class
:
library(caret)
library(data.table)
dt = data.frame(v1 = runif(100), v2 = rnorm(100),class = sample(factor(1:6),100,seq(0.1,0.6,by=0.1),replace=TRUE))
dt = data.table(dt)
我们检查输出 Class:
table(downSample(dt[,-3],dt$class)$Class)
1 2 3 4 5 6
4 4 4 4 4 4
table(upSample(dt[,-3],dt$class)$Class)
1 2 3 4 5 6
27 27 27 27 27 27
我们可以编写一个函数来完成它,但我真的不确定为什么 caret 不适合你:
n = min(table(dt$class))
idx = unlist(tapply(1:nrow(dt),dt$class,sample,n))
dt[idx,]
v1 v2 class
1: 0.24056931 0.98202652 1
2: 0.29899859 0.69350666 1
3: 0.05496686 1.32054392 1
4: 0.62017288 1.49824766 1
5: 0.67481604 0.45320585 2
6: 0.79654281 0.49854685 2
7: 0.74180115 0.87424714 2
8: 0.02848226 -0.74332299 2
9: 0.05007267 1.18599816 3
10: 0.94377121 -0.45921234 3
11: 0.63222065 0.77273476 3
12: 0.89684199 -0.74368572 3
13: 0.19782915 -0.62413381 4
14: 0.89286833 0.08664853 4
15: 0.48428538 -0.90199352 4
16: 0.08179512 1.51315151 4
17: 0.89740177 -2.28249763 5
18: 0.35267634 -0.54414029 5
19: 0.68710533 -1.99195471 5
20: 0.76743271 1.17255792 5
21: 0.80106456 0.21315622 6
22: 0.53640778 0.56632657 6
23: 0.38322745 0.74336152 6
24: 0.36704649 -0.43914106 6
我正在尝试两种不平衡数据的采样方法。 我使用了“Caret”包的“upSample”功能,一切顺利。 但是,当我使用“downSample”函数时出现以下错误:
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
我使用的命令语法是:
downtrain_eli=downSample(x=trainset_eli[,-16],
y=trainset_eli$Comportamento)
"trainset_eli" 有 34 列和 70.800 行
因为我使用随机森林模型来预测多 class (6) 响应变量,所以我正在测试这两个函数(up 和 dowsample)以保持我的数据平衡。然而,我看到“Caret”包还包含“train”功能,有更多选项来平衡数据。但是这个函数是一个模型类型的函数,我只是想让这个函数创建一个具有平衡数据的数据集,然后在我的随机森林模型中使用它。继续使用“上下”功能还是使用“火车”功能对我来说更好?如果是这样,我该如何在我的随机森林模型中实现它?
str(trainset_eli)
$ date : chr "01/10/2019" "24/09/2019" "01/10/2019" "01/10/2019" ...
$ air.temp : num 18.4 32.6 34.5 26.4 32.6 ...
$ relat.u : num 70 30.4 22.2 50.7 30.8 ...
$ wind.sp : num 1.14 2.81 1.51 3.33 2.17 ...
$ wind.dir : num 79.1 341.6 350.1 56.2 294.9 ...
$ solar.rad : num 39.6 741 433.9 621.1 274.6 ...
$ max.raj : num 1.65 5.25 2.85 6.05 4.45 ...
$ time : chr "06:40:00" "14:10:00" "14:40:00" "09:20:00" ...
$ timedate : POSIXct, format: "2019-10-01 06:43:48" "2019-09-24 14:10:45" "2019-10-01 14:48:50" ...
$ sensorid : int 67 65 66 70 70 70 69 68 69 65 ...
$ x : int -56 -49 15 35 -4 27 -40 33 -29 -47 ...
$ y : int -11 0 -4 24 10 34 -43 4 -4 5 ...
$ z : int -27 -37 -56 -20 -16 -44 -51 -49 -53 -41 ...
$ i.date : chr "01/10/2019" "24/09/2019" "01/10/2019" "01/10/2019" ...
$ i.time : chr "06:43:48" "14:10:45" "14:48:50" "09:21:41" ...
$ Comportamento: Factor w/ 6 levels "1","2","4","5",..: 6 3 3 5 2 2 1 1 2 1 ...
$ xg : num -0.875 -0.7656 0.2344 0.5469 -0.0625 ...
$ yg : num -0.1719 0 -0.0625 0.375 0.1562 ...
$ zg : num -0.422 -0.578 -0.875 -0.312 -0.25 ...
$ SMA : num 1.469 1.344 1.172 1.234 0.469 ...
$ SVM : num 0.986 0.959 0.908 0.733 0.301 ...
$ mov.var : num 0.0625 0.1094 0.0469 1.0156 1 ...
$ energy : num 0.94701 0.84715 0.67974 0.28875 0.00825 ...
$ entropy : num 0.2526 0.1219 0.0354 0.8179 0.0172 ...
$ pitch : num 62.5 52.9 -15 -48.2 12 ...
$ roll : num -158 180 -176 130 148 ...
$ inclination : num -64.7 -52.9 -15.5 -64.8 -33.9 ...
$ year : num 2019 2019 2019 2019 2019 ...
$ month : num 10 9 10 10 9 10 10 10 10 10 ...
$ day : int 1 24 1 1 24 1 1 1 1 1 ...
$ dayofweek : num 3 3 3 3 3 3 3 3 3 3 ...
$ hour : int 6 14 14 9 16 13 6 16 7 6 ...
$ minute : int 43 10 48 21 38 35 43 48 20 36 ...
$ second : num 48 45 50 41 45 16 36 13 43 57 ...
> dput(head(trainset_eli))
structure(list(date = c("01/10/2019", "24/09/2019", "01/10/2019",
"01/10/2019", "24/09/2019", "01/10/2019"), air.temp = c(18.42,
32.63, 34.54, 26.42, 32.63, 34.44), relat.u = c(70, 30.45, 22.19,
50.69, 30.83, 25.67), wind.sp = c(1.136, 2.809, 1.512, 3.326,
2.171, 2.04), wind.dir = c(79.1, 341.6, 350.1, 56.22, 294.9,
16.57), solar.rad = c(39.62, 741, 433.9, 621.1, 274.6, 847),
max.raj = c(1.647, 5.247, 2.847, 6.047, 4.447, 4.447), time = c("06:40:00",
"14:10:00", "14:40:00", "09:20:00", "16:30:00", "13:30:00"
), timedate = structure(c(1569912228, 1569334245, 1569941330,
1569921701, 1569343125, 1569936916), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), sensorid = c(67L, 65L, 66L, 70L,
70L, 70L), x = c(-56L, -49L, 15L, 35L, -4L, 27L), y = c(-11L,
0L, -4L, 24L, 10L, 34L), z = c(-27L, -37L, -56L, -20L, -16L,
-44L), i.date = c("01/10/2019", "24/09/2019", "01/10/2019",
"01/10/2019", "24/09/2019", "01/10/2019"), i.time = c("06:43:48",
"14:10:45", "14:48:50", "09:21:41", "16:38:45", "13:35:16"
), Comportamento = structure(c(6L, 3L, 3L, 5L, 2L, 2L), .Label = c("1",
"2", "4", "5", "6", "7"), class = "factor"), xg = c(-0.875,
-0.765625, 0.234375, 0.546875, -0.0625, 0.421875), yg = c(-0.171875,
0, -0.0625, 0.375, 0.15625, 0.53125), zg = c(-0.421875, -0.578125,
-0.875, -0.3125, -0.25, -0.6875), SMA = c(1.46875, 1.34375,
1.171875, 1.234375, 0.46875, 1.640625), SVM = c(0.986480882354037,
0.959380089563047, 0.907999389110477, 0.733044006608744,
0.30136408628103, 0.965847466282849), mov.var = c(0.0625,
0.109375, 0.046875, 1.015625, 1, 0.078125), energy = c(0.947010278701782,
0.847154855728149, 0.679739058017731, 0.288748800754547,
0.00824832916259766, 0.870230257511139), entropy = c(0.252618304422212,
0.121902803377891, 0.0354050216019417, 0.817915633557388,
0.0171719387098626, 0.109209155417093), pitch = c(62.4975813343597,
52.9434718105904, -14.9586823290351, -48.247900416119, 11.9694631246073,
-25.8994130495892), roll = c(-157.833654177918, 180, -175.914383220025,
129.805571092265, 147.994616791916, 142.305759533311), inclination = c(-64.6810700998259,
-52.9434718105904, -15.4942996397858, -64.7667344528855,
-33.9462950277539, -44.6176169165428), year = c(2019, 2019,
2019, 2019, 2019, 2019), month = c(10, 9, 10, 10, 9, 10),
day = c(1L, 24L, 1L, 1L, 24L, 1L), dayofweek = c(3, 3, 3,
3, 3, 3), hour = c(6L, 14L, 14L, 9L, 16L, 13L), minute = c(43L,
10L, 48L, 21L, 38L, 35L), second = c(48, 45, 50, 41, 45,
16)), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x56139e8dcfc0>, class = c("data.table",
"data.frame"))
不太确定为什么它不起作用,如果我使用不平衡 class 的示例集,我的标签名为 class
:
library(caret)
library(data.table)
dt = data.frame(v1 = runif(100), v2 = rnorm(100),class = sample(factor(1:6),100,seq(0.1,0.6,by=0.1),replace=TRUE))
dt = data.table(dt)
我们检查输出 Class:
table(downSample(dt[,-3],dt$class)$Class)
1 2 3 4 5 6
4 4 4 4 4 4
table(upSample(dt[,-3],dt$class)$Class)
1 2 3 4 5 6
27 27 27 27 27 27
我们可以编写一个函数来完成它,但我真的不确定为什么 caret 不适合你:
n = min(table(dt$class))
idx = unlist(tapply(1:nrow(dt),dt$class,sample,n))
dt[idx,]
v1 v2 class
1: 0.24056931 0.98202652 1
2: 0.29899859 0.69350666 1
3: 0.05496686 1.32054392 1
4: 0.62017288 1.49824766 1
5: 0.67481604 0.45320585 2
6: 0.79654281 0.49854685 2
7: 0.74180115 0.87424714 2
8: 0.02848226 -0.74332299 2
9: 0.05007267 1.18599816 3
10: 0.94377121 -0.45921234 3
11: 0.63222065 0.77273476 3
12: 0.89684199 -0.74368572 3
13: 0.19782915 -0.62413381 4
14: 0.89286833 0.08664853 4
15: 0.48428538 -0.90199352 4
16: 0.08179512 1.51315151 4
17: 0.89740177 -2.28249763 5
18: 0.35267634 -0.54414029 5
19: 0.68710533 -1.99195471 5
20: 0.76743271 1.17255792 5
21: 0.80106456 0.21315622 6
22: 0.53640778 0.56632657 6
23: 0.38322745 0.74336152 6
24: 0.36704649 -0.43914106 6