来自多列的随机样本

Question

我有一个包含多列的数据集，其中每一行代表一种产品，每一列都包含对相应产品的一条评论。对于每个产品，我们观察到多个评论，每个评论都存储在自己的列中。

现在我想通过以下方式创建两个新数据集： (1) 只有一列的数据集，包括从多个评论列中随机抽取的 x（条）评论样本。 (2) 与 (1) 相同，但现在我想从每一列中抽取相同数量的评论（例如，来自“comment1”的 2 条评论和来自“comment2”的 2 条评论）。

Example data:
commentda = data.frame(product_id = c(1,2,3,4), comment1 = c("Very good", "Bad", "Would buy it", "Zero stars"), comment2 = c("Bad reputation", "Good seller", "Great service", "I will buy it again"))
> 
> commentda
  product_id     comment1            comment2
1          1    Very good      Bad reputation
2          2          Bad         Good seller
3          3 Would buy it       Great service
4          4   Zero stars I will buy it again

Answer 1

您可能会得到长格式的数据，这将有助于高效地进行此类操作。

library(dplyr)
n <- 2

long_data <- commentda %>%  tidyr::pivot_longer(cols = starts_with('comment'))

包括随机 n 条评论

long_data %>% slice_sample(n = n)

包括来自每列的随机 n 评论。

long_data %>%  group_by(name) %>%  slice_sample(n = n)

来自多列的随机样本

Random sample from multiple columns

text

r

sample

function

dplyr