在训练和测试数据集中保持组间相同的比率
Keep same ratios between groups in training and test datasets
对于机器学习项目,我想将我的数据分成训练集和测试集,使特定组的比例在集合中保持一致。我创建了一个 40 行的虚拟 data.frame 来解释我自己。在这里,对于 "Region" 组,20% 的数据是 "North America" ,50% 是“欧洲”,20% 是亚洲,10% 是大洋洲。我想以随机子集结束,例如 25%整个数据,其中组的百分比组成 "Region" 保持不变。
换句话说,我想从这个开始:
City County Region
1 Shangai China Asia
2 Tokyo Japan Asia
3 Osaka Japan Asia
4 Hanoi Vietnam Asia
5 Beijing China Asia
6 Sapporo Japan Asia
7 Tottori Japan Asia
8 Saigon Vietnam Asia
9 Rome Italy Europe
10 Paris France Europe
11 Lisbon Portugal Europe
12 Berlin Germany Europe
13 Madrid Spain Europe
14 Vienna Austria Europe
15 Naples Italy Europe
16 Nice France Europe
17 Porto Portugal Europe
18 Frankfurt Germany Europe
19 Sevilla Spain Europe
20 Salzburg Austria Europe
21 Barcelona Spain Europe
22 Amsterdam Netherlands Europe
23 Bern Switzerland Europe
24 Milan Italy Europe
25 San Sebastian Spain Europe
26 Rotterdam Netherlands Europe
27 Zurich Switzerland Europe
28 Turin Italy Europe
29 Ney York City US North America
30 Toronto Canada North America
31 Mexico City Mexico North America
32 Atlanta US North America
33 Chicago US North America
34 Atlanta US North America
35 Vancouver Canada North America
36 Guadalajara Mexico North America
37 Sydney Australia Oceania
38 Wellington New Zealand Oceania
39 Melbourne Australia Oceania
40 Auckland New Zealand Oceania
以此结束(随机选择行对我来说很重要):
City County Region
1 New York US North America
2 Mexico City Mexico North America
3 Amsterdam Netherlands Europe
4 Madrid Spain Europe
5 Lisbon Portugal Europe
6 Rome Italy Europe
7 Paris France Europe
8 Tokyo Japan Asia
9 Osaka Japan Asia
10 Wellington New Zealand Oceania
caret
包中的 createDataPartition()
函数可用于将观察值分配给训练组和测试组,同时保留拆分变量的每个 class 中的百分比分布。我们将通过 Applied Predictive Modeling 的阿尔茨海默病数据来说明它的用途。
library(caret)
library(AppliedPredictiveModeling)
set.seed(90125)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = .6)[[1]]
training = adData[ inTrain,]
testing = adData[-inTrain,]
我们现在将为每个数据框中的因变量生成 tables,每个数据框中的 Impaired
百分比略低于 38%。
> table(training$diagnosis)
Impaired Control
55 146
> table(testing$diagnosis)
Impaired Control
36 96
> 55/146
[1] 0.3767123
> 36/96
[1] 0.375
>
使用原始数据 post
如果我们从问题提供的数据中抽取 75% 的样本,我们可以划分为 30 行的训练数据框和 10 行的测试数据框。
# OP data
textFile <- "id|City|County|Region
1|Shangai|China|Asia
2|Tokyo|Japan|Asia
3|Osaka|Japan|Asia
4|Hanoi|Vietnam|Asia
5|Beijing|China|Asia
6|Sapporo|Japan|Asia
7|Tottori|Japan|Asia
8|Saigon|Vietnam|Asia
9|Rome|Italy|Europe
10|Paris|France|Europe
11|Lisbon|Portugal|Europe
12|Berlin|Germany|Europe
13|Madrid|Spain|Europe
14|Vienna|Austria|Europe
15|Naples|Italy|Europe
16|Nice|France|Europe
17|Porto|Portugal|Europe
18|Frankfurt|Germany|Europe
19|Sevilla|Spain|Europe
20|Salzbourg|Austria|Europe
21|Barcelona|Spain|Europe
22|Amsterdam|Netherlands|Europe
23|Bern|Switzerland|Europe
24|Milan|Italy|Europe
25|SanSebastian|Spain|Europe
26|Rotterdam|Netherlands|Europe
27|Zurich|Switzerland|Europe
28|Turin|Italy|Europe
29|New York City|US|North America
30|Toronto|Canada|North America
31|Mexico City|Mexico|North America
32|Atlanta|US|North America
33|Chicago|US|North America
34|Atlanta|US|North America
35|Vancouver|Canada|North America
36|Guadalajara|Mexico|North America
37|Syndey|Australia|Oceania
38|Wellington|New Zealand|Oceania
39|Melbourn|Australia|Oceania
40|Auckland|New Zealand|Oceania"
data <- read.table(text = textFile,header = TRUE,sep = "|",
stringsAsFactors = FALSE)
set.seed(901250)
inTrain = createDataPartition(data$Region, p = .75)[[1]]
training = data[ inTrain,]
testing = data[-inTrain,]
当我们打印一个 table 的测试数据时,我们看到 Region
按照问题中的要求分布:20% 亚洲、50% 欧洲、20% 北美和 10 % 大洋洲。
> table(testing$Region)
Asia Europe NorthAmerica Oceania
2 5 2 1
>
最后,我们将打印 testing
数据框。
> testing
id City County Region
2 2 Tokyo Japan Asia
8 8 Saigon Vietnam Asia
9 9 Rome Italy Europe
17 17 Porto Portugal Europe
19 19 Sevilla Spain Europe
21 21 Barcelona Spain Europe
22 22 Amsterdam Netherlands Europe
32 32 Atlanta US North America
36 36 Guadalajara Mexico North America
38 38 Wellington New Zealand Oceania
>
对于机器学习项目,我想将我的数据分成训练集和测试集,使特定组的比例在集合中保持一致。我创建了一个 40 行的虚拟 data.frame 来解释我自己。在这里,对于 "Region" 组,20% 的数据是 "North America" ,50% 是“欧洲”,20% 是亚洲,10% 是大洋洲。我想以随机子集结束,例如 25%整个数据,其中组的百分比组成 "Region" 保持不变。
换句话说,我想从这个开始:
City County Region
1 Shangai China Asia
2 Tokyo Japan Asia
3 Osaka Japan Asia
4 Hanoi Vietnam Asia
5 Beijing China Asia
6 Sapporo Japan Asia
7 Tottori Japan Asia
8 Saigon Vietnam Asia
9 Rome Italy Europe
10 Paris France Europe
11 Lisbon Portugal Europe
12 Berlin Germany Europe
13 Madrid Spain Europe
14 Vienna Austria Europe
15 Naples Italy Europe
16 Nice France Europe
17 Porto Portugal Europe
18 Frankfurt Germany Europe
19 Sevilla Spain Europe
20 Salzburg Austria Europe
21 Barcelona Spain Europe
22 Amsterdam Netherlands Europe
23 Bern Switzerland Europe
24 Milan Italy Europe
25 San Sebastian Spain Europe
26 Rotterdam Netherlands Europe
27 Zurich Switzerland Europe
28 Turin Italy Europe
29 Ney York City US North America
30 Toronto Canada North America
31 Mexico City Mexico North America
32 Atlanta US North America
33 Chicago US North America
34 Atlanta US North America
35 Vancouver Canada North America
36 Guadalajara Mexico North America
37 Sydney Australia Oceania
38 Wellington New Zealand Oceania
39 Melbourne Australia Oceania
40 Auckland New Zealand Oceania
以此结束(随机选择行对我来说很重要):
City County Region
1 New York US North America
2 Mexico City Mexico North America
3 Amsterdam Netherlands Europe
4 Madrid Spain Europe
5 Lisbon Portugal Europe
6 Rome Italy Europe
7 Paris France Europe
8 Tokyo Japan Asia
9 Osaka Japan Asia
10 Wellington New Zealand Oceania
caret
包中的 createDataPartition()
函数可用于将观察值分配给训练组和测试组,同时保留拆分变量的每个 class 中的百分比分布。我们将通过 Applied Predictive Modeling 的阿尔茨海默病数据来说明它的用途。
library(caret)
library(AppliedPredictiveModeling)
set.seed(90125)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = .6)[[1]]
training = adData[ inTrain,]
testing = adData[-inTrain,]
我们现在将为每个数据框中的因变量生成 tables,每个数据框中的 Impaired
百分比略低于 38%。
> table(training$diagnosis)
Impaired Control
55 146
> table(testing$diagnosis)
Impaired Control
36 96
> 55/146
[1] 0.3767123
> 36/96
[1] 0.375
>
使用原始数据 post
如果我们从问题提供的数据中抽取 75% 的样本,我们可以划分为 30 行的训练数据框和 10 行的测试数据框。
# OP data
textFile <- "id|City|County|Region
1|Shangai|China|Asia
2|Tokyo|Japan|Asia
3|Osaka|Japan|Asia
4|Hanoi|Vietnam|Asia
5|Beijing|China|Asia
6|Sapporo|Japan|Asia
7|Tottori|Japan|Asia
8|Saigon|Vietnam|Asia
9|Rome|Italy|Europe
10|Paris|France|Europe
11|Lisbon|Portugal|Europe
12|Berlin|Germany|Europe
13|Madrid|Spain|Europe
14|Vienna|Austria|Europe
15|Naples|Italy|Europe
16|Nice|France|Europe
17|Porto|Portugal|Europe
18|Frankfurt|Germany|Europe
19|Sevilla|Spain|Europe
20|Salzbourg|Austria|Europe
21|Barcelona|Spain|Europe
22|Amsterdam|Netherlands|Europe
23|Bern|Switzerland|Europe
24|Milan|Italy|Europe
25|SanSebastian|Spain|Europe
26|Rotterdam|Netherlands|Europe
27|Zurich|Switzerland|Europe
28|Turin|Italy|Europe
29|New York City|US|North America
30|Toronto|Canada|North America
31|Mexico City|Mexico|North America
32|Atlanta|US|North America
33|Chicago|US|North America
34|Atlanta|US|North America
35|Vancouver|Canada|North America
36|Guadalajara|Mexico|North America
37|Syndey|Australia|Oceania
38|Wellington|New Zealand|Oceania
39|Melbourn|Australia|Oceania
40|Auckland|New Zealand|Oceania"
data <- read.table(text = textFile,header = TRUE,sep = "|",
stringsAsFactors = FALSE)
set.seed(901250)
inTrain = createDataPartition(data$Region, p = .75)[[1]]
training = data[ inTrain,]
testing = data[-inTrain,]
当我们打印一个 table 的测试数据时,我们看到 Region
按照问题中的要求分布:20% 亚洲、50% 欧洲、20% 北美和 10 % 大洋洲。
> table(testing$Region)
Asia Europe NorthAmerica Oceania
2 5 2 1
>
最后,我们将打印 testing
数据框。
> testing
id City County Region
2 2 Tokyo Japan Asia
8 8 Saigon Vietnam Asia
9 9 Rome Italy Europe
17 17 Porto Portugal Europe
19 19 Sevilla Spain Europe
21 21 Barcelona Spain Europe
22 22 Amsterdam Netherlands Europe
32 32 Atlanta US North America
36 36 Guadalajara Mexico North America
38 38 Wellington New Zealand Oceania
>