如何使用 R 和 dplyr 将数据分组为非预定间隔?

How to group data into not pre-determined intervals using R and dplyr?

我有这样的数据(明显简化了):

Var1 Var2 Var3
20   0.4  a
50   0.5  a
80   0.6  b
150  0.3  a
250  0.4  b

如果落在50的区间内,我想按照Var1分组,然后取Var1和Var2的均值,如果是同质的则保持Var3原样,如果分组混合标签则重命名。在这种情况下,我会得到:

Var1 Var2 Var3
50   0.5  mixed
150  0.3  a
250  0.4  b

我想我应该使用 dplyr 包中的 group_by 函数,但我不知道具体怎么做。感谢您的帮助!

这是 dput

的数据框
d <- structure(list(Var1 = c(20L, 50L, 80L, 150L, 250L), Var2 = c(0.4, 
0.5, 0.6, 0.3, 0.4), Var3 = structure(c(1L, 1L, 2L, 1L, 2L), .Label = c("a", 
"b"), class = "factor")), class = "data.frame", row.names = c(NA, 
-5L))

我会

  1. 创建一些临时列以确定新组何时开始
  2. 分组并计算平均值,同时跟踪 Var3
  3. 的不同值
  4. 如果一组中有多个 Var3 值,则更改为混合

在 tidyverse 中这可能看起来像

d %>% 
 # make sure we sort Var1
 arrange(Var1) %>% 
 # increment var1 by 50 and test that against the next row
 # if the next value exceeds current by 50, we mark it as a new group
 mutate(nextint=Var1+50, 
       newgroup=Var1>lag(nextint,default=-Inf), 
       grp=cumsum(newgroup)) %>%
 # for each group, get the mean and a comma separated list of distinct Var3 values
 group_by(grp) %>% 
 summarise(
           grplbl=floor(max(Var1)/50)*50,
           mu=mean(Var2), 
           mix=paste(collapse=",",unique(Var3))) %>%
 # if mix (distinct Var3) has a comma in it, change from e.g. 'a,b' to 'mix'
 mutate(mix=ifelse(grepl(',', mix), 'mixed', mix))
# A tibble: 3 x 4
    grp grplbl    mu mix  
  <int>  <dbl> <dbl> <chr>
1     1     50   0.5 mixed
2     2    150   0.3 a    
3     3    250   0.4 b  

另一种dplyr可能是:

df %>%
 group_by(grp = cumsum(Var1 - lag(Var1, default = first(Var1)) > 50)) %>%
 summarise(Var1 = mean(Var1),
           Var2 = mean(Var2),
           Var3 = ifelse(n_distinct(Var3) > 1, "mixed", Var3)) %>%
 ungroup() %>%
 select(-grp)

   Var1  Var2 Var3 
  <dbl> <dbl> <chr>
1    50   0.5 mixed
2   150   0.3 a    
3   250   0.4 b