使用 createDataPartition 的分层抽样将小类排除在测试之外

Question

我正在尝试进行分层抽样，我意识到当我的类案例很少时，我最终得到的测试数据集可能没有这些少数案例类.

这是一些示例代码

library(caret)

# data set for debugging in RStudio
data("imports85")
input<-imports85
    
# settings
set.seed(1)
dependent <- make.names("make")
training.share <- 0.75
impute <- "no"
type <- "classification"

# save original column names for later and make R-friendly column names
original.names <- names(input)
names(input) <- make.names(original.names)
    
# create train and test data sets
input.labelled <- input[complete.cases(input[,dependent]),] #split off rows w/o dependent
if (impute=="no") { 
    input.clean <- input.labelled[complete.cases(input.labelled),] #drop cases w/ missing variables
} else if (impute=="yes") {
    input.clean <- rfImpute(input.labelled[,dependent] ~ .,input.labelled)[,-1] #or impute missing variables and remove added duplicate of dependent column
}

train.index <- createDataPartition(input.clean[,dependent], p=training.share, list=FALSE) #create row index for train data set using stratified sampling but very small classes might all go into train?!
rf.train <- input.clean[train.index,] #create train data set
rf.test <- input.clean[-train.index,] #create test data set from left-overs
if (type=="classification") { #balance train data set for classification (can be skipped if upsampling takes place as part of tuning settings cntrl)
    rf.train <- upSample(x=rf.train[, names(rf.train) != dependent], y=rf.train[, names(rf.train) == dependent], yname=dependent)
}

# define variables Y and dependent x
Y.train <- rf.train[, names(rf.train) == dependent]
x.train <- rf.train[, names(rf.train) != dependent]
Y.test <- rf.test[, names(rf.test) == dependent]
x.test <- rf.test[, names(rf.test) != dependent]

# train single RF model
rf <- randomForest(x.train, y=Y.train, xtest=x.test, ytest=Y.test, type=type, keep.forest=TRUE)

您会收到来自 createDataPartition 的警告，您会看到例如“make”==chevrolet 在 rf.train 中有 3 个案例，在 [=24= 中有 none ]，这可能会导致 randomForest.

下游出现问题

有什么聪明的方法可以避免 w/o 将数据从火车泄漏到测试中？

Answer 1

很多都是一样的，但不是全部。

相同：

这是因为你的因变量。您选择了 make。你检查过这个领域吗？你有培训和测试；你把只有一个观察结果的结果放在哪里，比如make = "mercury"？你怎么能用那个训练？如果你不为此训练，你怎么能测试它？

input %>% 
  group_by(make) %>% 
  summarise(count = n()) %>% 
  arrange(count) %>% 
  print(n = 22)

# # A tibble: 22 × 2
#    make        count
#    <fct>       <int>
#  1 mercury         1
#  2 renault         2
#  3 alfa-romero     3
#  4 chevrolet       3
#  5 jaguar          3
#  6 isuzu           4
#  7 porsche         5
#  8 saab            6
#  9 audi            7
# 10 plymouth        7
# 11 bmw             8
# 12 mercedes-benz   8
# 13 dodge           9
# 14 peugot         11
# 15 volvo          11
# 16 subaru         12
# 17 volkswagen     12
# 18 honda          13
# 19 mitsubishi     13
# 20 mazda          17
# 21 nissan         18
# 22 toyota         32

当您执行函数时createDataPartition(), you also had warnings。我认为 randomForest 套餐要求每组至少五个。您可以筛选要包含的组，并将该数据用于测试和培训。

在标记为 settings 的评论之前，您可以添加以下内容以对组进行子集化并验证结果。

filtGrps <- input %>% 
  group_by(make) %>% 
  summarise(count = n()) %>% 
  filter(count >=5) %>% 
  select(make) %>% 
  unlist()

# filter for groups with sufficient observations for package
input <- input %>% 
  filter(make %in% filtGrps) %>% 
  droplevels() # then drop the empty levels

# check to see if it filtered as expected
input %>% 
  group_by(make) %>% 
  summarise(count = n()) %>% 
  arrange(-count) %>% 
  print(n = 16)

这只使用了 5 个，这并不理想。（越多越好。）

这里改

在 caret 模型中，您使用了插补。你没有为这个模型这样做。您在创建 input.clean 时删除了另外 34 个观察值。那时...

# you removed another 34 rows- need to check the classes, again
# you imputed for caret/train
input.clean %>% 
  group_by(make) %>% 
  summarise(count = n()) %>% 
  arrange(-count) %>% 
  print(n = 16)
# # A tibble: 16 × 2
#    make          count
#    <fct>         <int>
#  1 toyota           31
#  2 nissan           18
#  3 honda            13
#  4 subaru           12
#  5 mazda            11
#  6 volvo            11
#  7 mitsubishi       10
#  8 dodge             8
#  9 volkswagen        8
# 10 peugot            7
# 11 plymouth          6
# 12 saab              6
# 13 mercedes-benz     5
# 14 audi              4
# 15 bmw               4
# 16 porsche           1

你现在需要再丢三个类。

# there is an exclamation point to negate this
input.clean <- input.clean %>% 
  filter(!make %in% c("audi", "bmw", "porsche")) %>% 
  droplevels()

# validate changes
input.clean %>% 
  group_by(make) %>% 
  summarise(count = n()) %>% 
  arrange(-count) %>% 
  print(n = 16)
# 13 classes now

从这里开始，您的代码就可以使用了。

rf
# 
# Call:
#  randomForest(x = x.train, y = Y.train, xtest = x.test, ytest = Y.test,      keep.forest = TRUE, type = type) 
#                Type of random forest: classification
#                      Number of trees: 500
# No. of variables tried at each split: 5
# 
#         OOB estimate of  error rate: 1.92%
# Confusion matrix:
#               dodge honda mazda mercedes-benz mitsubishi nissan peugot
# dodge            24     0     0             0          0      0      0
# honda             0    22     0             0          2      0      0
# mazda             0     0    24             0          0      0      0
# mercedes-benz     0     0     0            24          0      0      0
# mitsubishi        0     0     0             0         23      0      0
# nissan            0     0     0             0          0     23      0
# peugot            0     0     0             0          0      0     24
# plymouth          0     0     0             0          0      0      0
# saab              0     0     0             0          0      0      0
# subaru            0     0     0             0          0      0      0
# toyota            0     0     0             0          0      1      0
# volkswagen        0     0     0             0          0      0      0
# volvo             0     0     0             0          0      0      0
#               plymouth saab subaru toyota volkswagen volvo class.error
# dodge                0    0      0      0          0     0  0.00000000
# honda                0    0      0      0          0     0  0.08333333
# mazda                0    0      0      0          0     0  0.00000000
# mercedes-benz        0    0      0      0          0     0  0.00000000
# mitsubishi           1    0      0      0          0     0  0.04166667
# nissan               0    0      0      1          0     0  0.04166667
# peugot               0    0      0      0          0     0  0.00000000
# plymouth            24    0      0      0          0     0  0.00000000
# saab                 0   24      0      0          0     0  0.00000000
# subaru               0    0     24      0          0     0  0.00000000
# toyota               0    0      0     22          0     1  0.08333333
# volkswagen           0    0      0      0         24     0  0.00000000
# volvo                0    0      0      0          0    24  0.00000000
#                 Test set error rate: 3.23%

提示 - 如果您在同一个脚本文件中进行这些调用，请在模型之间使用唯一的对象名称，这样，您始终知道哪个对象中有哪些数据。它可能是导致各种问题的隐藏错误。

使用 createDataPartition 的分层抽样将小 类 排除在测试之外

Stratified sampling using createDataPartition drops small classes out of test

r

sampling

相同：

这里改

使用 createDataPartition 的分层抽样将小类排除在测试之外