为什么 broom::augment return 行比输入数据多?

Why does broom::augment return more rows than input data?

我注意到在使用 Broom 的扩充功能时,新创建的数据框的行数比我最初开始时多。例如

# Statistical Modeling
## dummy vars
library(tidyverse)
training_data <- mtcars
dummy <- caret::dummyVars(~ ., data = training_data, fullRank = T, sep = ".")
training_data <- predict(dummy, mtcars) %>% as.data.frame()
clean_names <- names(training_data) %>% str_replace_all(" |`", "")
names(training_data) <- clean_names

## make target a factor
target <- training_data$mpg
target <- ifelse(target < 20, 0,1) %>% as.factor() %>% make.names()

## custom evaluation metric function
my_summary  <- function(data, lev = NULL, model = NULL){
  a1 <- defaultSummary(data, lev, model)
  b1 <- twoClassSummary(data, lev, model)
  c1 <- prSummary(data, lev, model)
  out <- c(a1, b1, c1)
  out}

## tuning & parameters
set.seed(123)
train_control <- trainControl(
  method = "cv",
  number = 3,
  sampling = "up", # over sample due to inbalanced data
  savePredictions = TRUE,
  verboseIter = TRUE,
  classProbs = TRUE,
  summaryFunction = my_summary
)

linear_model = train(
  x = select(training_data, -mpg), 
  y = target,
  trControl = train_control,
  method = "glm", # logistic regression
  family = "binomial",
  metric = "AUC"
)

library(broom)
linear_augment <- augment(linear_model$finalModel)

现在,如果我查看我的新增强数据框并与原始 mtcars 数据框进行比较:

> nrow(mtcars)
[1] 32
> nrow(linear_augment)
[1] 36

预期是 32 行而不是 36 行。这是为什么?

您在 trainControl 调用中进行了上采样,导致样本比原始数据集多。

## tuning & parameters
set.seed(123)
train_control <- trainControl(
  method = "cv",
  number = 3,
  # sampling = "up", # over sample due to inbalanced data
  savePredictions = TRUE,
  verboseIter = TRUE,
  classProbs = TRUE,
  summaryFunction = my_summary
)

linear_model = train(
  x = select(training_data, -mpg), 
  y = target,
  trControl = train_control,
  method = "glm", # logistic regression
  family = "binomial",
  metric = "AUC"
)
library(broom)
linear_augment <- augment(linear_model$finalModel)

注意上采样被注释掉了

> dim(linear_augment)
[1] 32 19