如何根据多个分组条件对整个组进行随机抽样

How to randomly sample entire group based on multiple grouping conditions

我有一个这样的数据框:

df <- tribble(
  ~uniquename, ~frame, ~id,       ~datetime,      
  "unique1",1, "b1", "2021-05-05 07:05:01", 
  "unique1",1, "b5" , "2021-05-05 07:05:01", 
  "unique1",2, "b1", "2021-05-05 07:05:03", 
  "unique1",2, "b2", "2021-05-05 07:05:03", 
  "unique1",2, "b3" , "2021-05-05 07:05:03", 
  "unique1",3, "b2", "2021-05-05 07:07:03", 
  "unique1",3, "b4" , "2021-05-05 07:07:03", 
  "unique2",1, "b3", "2021-06-06 09:17:25",
  "unique2",1, "b4", "2021-06-06 09:17:25", 
  "unique2",12, "b5", "2021-06-06 09:20:17", 
  "unique2",12, "b6" , "2021-06-06 09:20:17",
  "unique2",16, "b1", "2021-06-06 09:20:59", 
  "unique2",16, "b2", "2021-06-06 09:20:59", 
  "unique2",16, "b3" , "2021-06-06 09:20:59", 
  "unique2",16, "b4", "2021-06-06 09:20:59")

我正在尝试为每个唯一分组变量 (uniquename) 每分钟(基于日期时间)提取一组随机行(具体来说,来自列 'frame' 的随机组)。为了尝试让这个声音更清晰,在每个唯一的 'uniquename' 变量中,我想每 n 分钟提取一个分组帧(在本例中为 1 分钟,但理论上可以是 5、10 等)。

因此,对于 1 分钟的示例,结果将如下所示:

result_df <- tribble(
  ~uniquename, ~frame, ~id,       ~datetime,      
  "unique1",2, "b1", "2021-05-05 07:05:03", 
  "unique1",2, "b2", "2021-05-05 07:05:03", 
  "unique1",2, "b3" , "2021-05-05 07:05:03", 
  "unique1",3, "b2", "2021-05-05 07:07:03", 
  "unique1",3, "b4" , "2021-05-05 07:07:03", 
  "unique2",1, "b3", "2021-06-06 09:17:25",
  "unique2",1, "b4", "2021-06-06 09:17:25", 
  "unique2",12, "b5", "2021-06-06 09:20:17", 
  "unique2",12, "b6" , "2021-06-06 09:20:17")

如您所见,在 'unique1' 内,只保留了第 2 帧和第 3 帧,因为第 1 帧和第 2 帧彼此相隔 <1 分钟,我只想随机 select 一个他们留下来。

当我尝试这个时,我创建了一个新列,为每一分钟创建一个时间戳,并尝试像这样切片:

df <- df %>% group_by(uniquename) %>% mutate(mincut = cut(datetime, "1 min")) %>% group_by(uniquename,mincut) %>% slice_sample()

但这只对每组切片 1 行,而不是整个组。

在 data.table 中(这实际上是更可取的,因为我的数据框有 ~1000000 行),我尝试了这个(包括 dplyr 代码中的 mincut 列)但它也只为每个组提取 1 行,不是整个组。

df[df[ 样本(.I, 1), by=c('uniquename','mincut')][[3]],]

有什么方法可以修改这些代码,或者有其他方法可以让我在 uniquename 中的每一分钟提取整个组吗?

非常感谢。

您可以使用 lubridate::floor_date 创建组,然后 filter 每个组随机 sampled 帧。您可以在floor_date中手动设置您需要的间隔,这里是"1 minute"

df %>% 
  mutate(datetime = ymd_hms(datetime),
           fl = floor_date(datetime, "1 minute")) %>% 
  group_by(uniquename, fl) %>% 
  filter(frame == sample(unique(frame), 1))

输出:

# A tibble: 11 × 5
# Groups:   uniquename, floor [4]
   uniquename frame id    datetime            fl              
   <chr>      <dbl> <chr> <dttm>              <dttm>             
 1 unique1        2 b1    2021-05-05 07:05:03 2021-05-05 07:05:00
 2 unique1        2 b2    2021-05-05 07:05:03 2021-05-05 07:05:00
 3 unique1        2 b3    2021-05-05 07:05:03 2021-05-05 07:05:00
 4 unique1        3 b2    2021-05-05 07:07:03 2021-05-05 07:07:00
 5 unique1        3 b4    2021-05-05 07:07:03 2021-05-05 07:07:00
 6 unique2        1 b3    2021-06-06 09:17:25 2021-06-06 09:17:00
 7 unique2        1 b4    2021-06-06 09:17:25 2021-06-06 09:17:00
 8 unique2       16 b1    2021-06-06 09:20:59 2021-06-06 09:20:00
 9 unique2       16 b2    2021-06-06 09:20:59 2021-06-06 09:20:00
10 unique2       16 b3    2021-06-06 09:20:59 2021-06-06 09:20:00
11 unique2       16 b4    2021-06-06 09:20:59 2021-06-06 09:20:00