tibble 中的逐步采样
Stepwise sampling in a tibble
我正在尝试通过对多个步骤进行采样来模拟一些数据。
第一步(创建 x)工作正常。
在第二步中,我想根据 x 的值从不同的向量中采样来创建变量 y。
我的代码运行没有错误,但在我试图实现的目标上失败了,因为它只对一个值进行采样,例如 x == "A",然后为 x = 的所有后续行重复使用该值= "A"。我希望它为 x == "A"
的每一行采样一次
代码:
library(tidyverse)
set.seed(1)
data <- tibble(
x = sample(c("A", "B", "C"), size = 10000, prob = c(0.1, 0.2, 0.7), replace = TRUE),
y = case_when(
x == "A" ~ sample(c("A1", "A2", "A3"), size = 1, prob = c(0.3, 0.4, 0.3)),
x == "B" ~ sample(c("B1", "B2", "B3"), size = 1, prob = c(0.3, 0.4, 0.3)),
x == "C" ~ sample(c("C1", "C2", "C3"), size = 1, prob = c(0.3, 0.4, 0.3)),
))
unique(data$x)
[1] "C" "A" "B"
unique(data$y)
[1] "C1" "A2" "B3"
如果代码按预期工作 unique(data$y)
应该 return 类似于 [1] "A1", "A2", "A3", "B1", "B2", "B3", "C1", "C2", "C3"
我知道问题出在 sample() 中的 size = 1
参数,但我可以用什么替换它?删除它 return 错误:
Error: `x == "A" ~ sample(c("A1", "A2", "A3"), prob = c(0.3, 0.4, 0.3))` must be length 100 or one, not 3
而且我已经尝试了 size = nrow(.data)
和 size=nrow(.)
,但这也是 return 的错误。
有没有简单的解决方法?
也许有更简单的方法,但这与您的原始代码很接近并且可以工作...
data <- tibble(
x = sample(c("A", "B", "C"), size = 10000, prob = c(0.1, 0.2, 0.7), replace = TRUE)) %>%
rowwise() %>%
summarise(x= x,
y = case_when(
x == "A" ~ sample(c("A1", "A2", "A3"), size = 1, prob = c(0.3, 0.4, 0.3)),
x == "B" ~ sample(c("B1", "B2", "B3"), size = 1, prob = c(0.3, 0.4, 0.3)),
x == "C" ~ sample(c("C1", "C2", "C3"), size = 1, prob = c(0.3, 0.4, 0.3)),
))
它与向量化函数和回收有关。如果你把它矢量化,它会回收相同的值。如果你用循环来做,它就可以工作。例如,
v1 <- c('A', 'A', 'B', 'B', 'C', 'C', 'C', 'A', 'A')
#Vectorized ifelse
ifelse(v1 == 'A', sample(c("A1", "A2", "A3"), size = 1, prob = c(0.3, 0.4, 0.3)), NA)
#[1] "A3" "A3" NA NA NA NA NA "A3" "A3"
#Not vectorized if/else with a loop,
sapply(v1, function(i) if (i == 'A') { sample(c("A1", "A2", "A3"), size = 1, prob = c(0.3, 0.4, 0.3)) }else {NA})
# A A B B C C C A A
#"A2" "A3" NA NA NA NA NA "A2" "A1"
分解成几个步骤就容易理解了
library(dplyr)
data <- tibble(
x = sample(c("A", "B", "C"), size = 10000,
prob = c(0.1, 0.2, 0.7), replace = TRUE))
data <- data %>%
mutate(y = case_when(
x == "A" ~ sample(c("A1", "A2", "A3"), size = n(),
prob = c(0.3, 0.4, 0.3), replace = TRUE),
x == "B" ~ sample(c("B1", "B2", "B3"), size = n(),
prob = c(0.3, 0.4, 0.3), replace = TRUE),
x == "C" ~ sample(c("C1", "C2", "C3"), size = n(),
prob = c(0.3, 0.4, 0.3), replace = TRUE),
))
unique(data$y)
#[1] "C2" "B3" "A1" "C3" "B1" "C1" "B2" "A3" "A2"
或者,如果您想继续您的方式,您需要指定 size
参数,与 x
和 replace = TRUE
中提到的参数相同
data <- tibble(
x = sample(c("A", "B", "C"), size = 10000,
prob = c(0.1, 0.2, 0.7), replace = TRUE),
y = case_when(
x == "A" ~ sample(c("A1", "A2", "A3"), size = 10000,
prob = c(0.3, 0.4, 0.3), replace = TRUE),
x == "B" ~ sample(c("B1", "B2", "B3"), size = 10000,
prob = c(0.3, 0.4, 0.3), replace = TRUE),
x == "C" ~ sample(c("C1", "C2", "C3"), size = 10000,
prob = c(0.3, 0.4, 0.3), replace = TRUE),
))
我正在尝试通过对多个步骤进行采样来模拟一些数据。
第一步(创建 x)工作正常。
在第二步中,我想根据 x 的值从不同的向量中采样来创建变量 y。
我的代码运行没有错误,但在我试图实现的目标上失败了,因为它只对一个值进行采样,例如 x == "A",然后为 x = 的所有后续行重复使用该值= "A"。我希望它为 x == "A"
的每一行采样一次代码:
library(tidyverse)
set.seed(1)
data <- tibble(
x = sample(c("A", "B", "C"), size = 10000, prob = c(0.1, 0.2, 0.7), replace = TRUE),
y = case_when(
x == "A" ~ sample(c("A1", "A2", "A3"), size = 1, prob = c(0.3, 0.4, 0.3)),
x == "B" ~ sample(c("B1", "B2", "B3"), size = 1, prob = c(0.3, 0.4, 0.3)),
x == "C" ~ sample(c("C1", "C2", "C3"), size = 1, prob = c(0.3, 0.4, 0.3)),
))
unique(data$x)
[1] "C" "A" "B"
unique(data$y)
[1] "C1" "A2" "B3"
如果代码按预期工作 unique(data$y)
应该 return 类似于 [1] "A1", "A2", "A3", "B1", "B2", "B3", "C1", "C2", "C3"
我知道问题出在 sample() 中的 size = 1
参数,但我可以用什么替换它?删除它 return 错误:
Error: `x == "A" ~ sample(c("A1", "A2", "A3"), prob = c(0.3, 0.4, 0.3))` must be length 100 or one, not 3
而且我已经尝试了 size = nrow(.data)
和 size=nrow(.)
,但这也是 return 的错误。
有没有简单的解决方法?
也许有更简单的方法,但这与您的原始代码很接近并且可以工作...
data <- tibble(
x = sample(c("A", "B", "C"), size = 10000, prob = c(0.1, 0.2, 0.7), replace = TRUE)) %>%
rowwise() %>%
summarise(x= x,
y = case_when(
x == "A" ~ sample(c("A1", "A2", "A3"), size = 1, prob = c(0.3, 0.4, 0.3)),
x == "B" ~ sample(c("B1", "B2", "B3"), size = 1, prob = c(0.3, 0.4, 0.3)),
x == "C" ~ sample(c("C1", "C2", "C3"), size = 1, prob = c(0.3, 0.4, 0.3)),
))
它与向量化函数和回收有关。如果你把它矢量化,它会回收相同的值。如果你用循环来做,它就可以工作。例如,
v1 <- c('A', 'A', 'B', 'B', 'C', 'C', 'C', 'A', 'A')
#Vectorized ifelse
ifelse(v1 == 'A', sample(c("A1", "A2", "A3"), size = 1, prob = c(0.3, 0.4, 0.3)), NA)
#[1] "A3" "A3" NA NA NA NA NA "A3" "A3"
#Not vectorized if/else with a loop,
sapply(v1, function(i) if (i == 'A') { sample(c("A1", "A2", "A3"), size = 1, prob = c(0.3, 0.4, 0.3)) }else {NA})
# A A B B C C C A A
#"A2" "A3" NA NA NA NA NA "A2" "A1"
分解成几个步骤就容易理解了
library(dplyr)
data <- tibble(
x = sample(c("A", "B", "C"), size = 10000,
prob = c(0.1, 0.2, 0.7), replace = TRUE))
data <- data %>%
mutate(y = case_when(
x == "A" ~ sample(c("A1", "A2", "A3"), size = n(),
prob = c(0.3, 0.4, 0.3), replace = TRUE),
x == "B" ~ sample(c("B1", "B2", "B3"), size = n(),
prob = c(0.3, 0.4, 0.3), replace = TRUE),
x == "C" ~ sample(c("C1", "C2", "C3"), size = n(),
prob = c(0.3, 0.4, 0.3), replace = TRUE),
))
unique(data$y)
#[1] "C2" "B3" "A1" "C3" "B1" "C1" "B2" "A3" "A2"
或者,如果您想继续您的方式,您需要指定 size
参数,与 x
和 replace = TRUE
data <- tibble(
x = sample(c("A", "B", "C"), size = 10000,
prob = c(0.1, 0.2, 0.7), replace = TRUE),
y = case_when(
x == "A" ~ sample(c("A1", "A2", "A3"), size = 10000,
prob = c(0.3, 0.4, 0.3), replace = TRUE),
x == "B" ~ sample(c("B1", "B2", "B3"), size = 10000,
prob = c(0.3, 0.4, 0.3), replace = TRUE),
x == "C" ~ sample(c("C1", "C2", "C3"), size = 10000,
prob = c(0.3, 0.4, 0.3), replace = TRUE),
))