使用另一个数据帧按组从数据帧中采样 n 行

Sample n rows from a data frame by group using another data frame

希望根据另一个数据框的条件按组从一个数据框中随机抽取 n 行。

示例

根据 manufactureryear 分组从 ggplot2::mpg 数据框中随机采样行,其中 n = pick_df 数据的 pick 列帧.

i.e. randomly sample 3 rows from ggplot2::mpg that are hondas made in 2008, 10 volkswagens made in 1999, 2 audis made in 1999, etc.

  manufacturer  year  pick
  <chr>        <int> <int>
1 honda         2008     3
2 volkswagen    1999    10
3 audi          1999     6
4 land rover    2008     2
5 subaru        1999     6

预期输出:

  manufacturer model      displ  year   cyl trans      drv     cty   hwy fl    class     
   <chr>        <chr>      <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>     
 1 honda        civic        1.8  2008     4 manual(m5) f        26    34 r     subcompact
 2 honda        civic        1.8  2008     4 auto(l5)   f        25    36 r     subcompact
 3 honda        civic        1.8  2008     4 auto(l5)   f        24    36 c     subcompact
 4 volkswagen   gti          2.8  1999     6 manual(m5) f        17    24 r     compact   
 5 volkswagen   passat       2.8  1999     6 manual(m5) f        18    26 p     midsize   
 6 volkswagen   new beetle   1.9  1999     4 auto(l4)   f        29    41 d     subcompact
 7 volkswagen   new beetle   2    1999     4 auto(l4)   f        19    26 r     subcompact
 8 volkswagen   jetta        1.9  1999     4 manual(m5) f        33    44 d     compact   
 9 volkswagen   passat       2.8  1999     6 auto(l5)   f        16    26 p     midsize   
10 volkswagen   jetta        2.8  1999     6 auto(l4)   f        16    23 r     compact   
11 volkswagen   new beetle   2    1999     4 manual(m5) f        21    29 r     subcompact
12 volkswagen   passat       1.8  1999     4 manual(m5) f        21    29 p     midsize   
13 volkswagen   gti          2    1999     4 auto(l4)   f        19    26 r     compact  

...27 rows total...

Header 从中采样的 mpg 数据帧:

   manufacturer model      displ  year   cyl trans      drv     cty   hwy fl    class  
   <chr>        <chr>      <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>  
 1 audi         a4           1.8  1999     4 auto(l5)   f        18    29 p     compact
 2 audi         a4           1.8  1999     4 manual(m5) f        21    29 p     compact
 3 audi         a4           2    2008     4 manual(m6) f        20    31 p     compact
 4 audi         a4           2    2008     4 auto(av)   f        21    30 p     compact
 5 audi         a4           2.8  1999     6 auto(l5)   f        16    26 p     compact
 6 audi         a4           2.8  1999     6 manual(m5) f        18    26 p     compact
 7 audi         a4           3.1  2008     6 auto(av)   f        18    27 p     compact
 8 audi         a4 quattro   1.8  1999     4 manual(m5) 4        18    26 p     compact
 9 audi         a4 quattro   1.8  1999     4 auto(l5)   4        16    25 p     compact
10 audi         a4 quattro   2    2008     4 manual(m6) 4        20    28 p     compact

reprex 的数据来源:

选择数据框的来源pick_df:

structure(list(manufacturer = c("honda", "volkswagen", "audi", 
"land rover", "subaru"), year = c(2008L, 1999L, 1999L, 2008L, 
1999L), pick = c(3L, 10L, 6L, 2L, 6L)), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -5L))

mpg 要采样的数据框: ggplot2::mpg

到目前为止已经试过了

我可以使用过滤器或可能的切片,但编码都是手动的。真实用例有数千行和数百组。

filter(mpg, manufacturer=='honda', year==2008) %>% sample_n(3)
filter(mpg, manufacturer=='volkswagen', year==1999) %>% sample_n(10)
etc...

编辑: 可以循环过滤,但是有点丑:

df <- mpg[0,]
for(i in 1:nrow(pick_df)){
  temp <- filter(mpg, manufacturer==pick_df$manufacturer[i], year==pick_df$year[i]) %>% sample_n(pick_df$pick[i])
  df <- rbind(temp,df)
}

我们可以用 'pick_df' 做一个 inner_join,按 'manufacturer'、'year' 分组,根据 first 得到 sample_n 'pick'

的值
library(dplyr)   
library(ggplot20 
mpg %>%
    inner_join(pick_df) %>% 
    group_by(manufacturer, year) %>%
    sample_n(first(pick))
# A tibble: 27 x 12
# Groups:   manufacturer, year [5]
#   manufacturer model       displ  year   cyl trans      drv     cty   hwy fl    class       pick
#   <chr>        <chr>       <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>      <int>
# 1 audi         a4 quattro    1.8  1999     4 auto(l5)   4        16    25 p     compact        6
# 2 audi         a6 quattro    2.8  1999     6 auto(l5)   4        15    24 p     midsize        6
# 3 audi         a4            2.8  1999     6 auto(l5)   f        16    26 p     compact        6
# 4 audi         a4 quattro    2.8  1999     6 auto(l5)   4        15    25 p     compact        6
# 5 audi         a4            1.8  1999     4 auto(l5)   f        18    29 p     compact        6
# 6 audi         a4            2.8  1999     6 manual(m5) f        18    26 p     compact        6
# 7 honda        civic         1.8  2008     4 manual(m5) f        26    34 r     subcompact     3
# 8 honda        civic         2    2008     4 manual(m6) f        21    29 p     subcompact     3
# 9 honda        civic         1.8  2008     4 auto(l5)   f        24    36 c     subcompact     3
#10 land rover   range rover   4.2  2008     8 auto(s6)   4        12    18 r     suv            2
# … with 17 more rows