如何使用 R 和 dplyr 将数据分组为非预定间隔?
How to group data into not pre-determined intervals using R and dplyr?
我有这样的数据(明显简化了):
Var1 Var2 Var3
20 0.4 a
50 0.5 a
80 0.6 b
150 0.3 a
250 0.4 b
如果落在50的区间内,我想按照Var1分组,然后取Var1和Var2的均值,如果是同质的则保持Var3原样,如果分组混合标签则重命名。在这种情况下,我会得到:
Var1 Var2 Var3
50 0.5 mixed
150 0.3 a
250 0.4 b
我想我应该使用 dplyr
包中的 group_by
函数,但我不知道具体怎么做。感谢您的帮助!
这是 dput
的数据框
d <- structure(list(Var1 = c(20L, 50L, 80L, 150L, 250L), Var2 = c(0.4,
0.5, 0.6, 0.3, 0.4), Var3 = structure(c(1L, 1L, 2L, 1L, 2L), .Label = c("a",
"b"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
我会
- 创建一些临时列以确定新组何时开始
- 分组并计算平均值,同时跟踪 Var3
的不同值
- 如果一组中有多个 Var3 值,则更改为混合
在 tidyverse 中这可能看起来像
d %>%
# make sure we sort Var1
arrange(Var1) %>%
# increment var1 by 50 and test that against the next row
# if the next value exceeds current by 50, we mark it as a new group
mutate(nextint=Var1+50,
newgroup=Var1>lag(nextint,default=-Inf),
grp=cumsum(newgroup)) %>%
# for each group, get the mean and a comma separated list of distinct Var3 values
group_by(grp) %>%
summarise(
grplbl=floor(max(Var1)/50)*50,
mu=mean(Var2),
mix=paste(collapse=",",unique(Var3))) %>%
# if mix (distinct Var3) has a comma in it, change from e.g. 'a,b' to 'mix'
mutate(mix=ifelse(grepl(',', mix), 'mixed', mix))
# A tibble: 3 x 4
grp grplbl mu mix
<int> <dbl> <dbl> <chr>
1 1 50 0.5 mixed
2 2 150 0.3 a
3 3 250 0.4 b
另一种dplyr
可能是:
df %>%
group_by(grp = cumsum(Var1 - lag(Var1, default = first(Var1)) > 50)) %>%
summarise(Var1 = mean(Var1),
Var2 = mean(Var2),
Var3 = ifelse(n_distinct(Var3) > 1, "mixed", Var3)) %>%
ungroup() %>%
select(-grp)
Var1 Var2 Var3
<dbl> <dbl> <chr>
1 50 0.5 mixed
2 150 0.3 a
3 250 0.4 b
我有这样的数据(明显简化了):
Var1 Var2 Var3
20 0.4 a
50 0.5 a
80 0.6 b
150 0.3 a
250 0.4 b
如果落在50的区间内,我想按照Var1分组,然后取Var1和Var2的均值,如果是同质的则保持Var3原样,如果分组混合标签则重命名。在这种情况下,我会得到:
Var1 Var2 Var3
50 0.5 mixed
150 0.3 a
250 0.4 b
我想我应该使用 dplyr
包中的 group_by
函数,但我不知道具体怎么做。感谢您的帮助!
这是 dput
d <- structure(list(Var1 = c(20L, 50L, 80L, 150L, 250L), Var2 = c(0.4,
0.5, 0.6, 0.3, 0.4), Var3 = structure(c(1L, 1L, 2L, 1L, 2L), .Label = c("a",
"b"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
我会
- 创建一些临时列以确定新组何时开始
- 分组并计算平均值,同时跟踪 Var3 的不同值
- 如果一组中有多个 Var3 值,则更改为混合
在 tidyverse 中这可能看起来像
d %>%
# make sure we sort Var1
arrange(Var1) %>%
# increment var1 by 50 and test that against the next row
# if the next value exceeds current by 50, we mark it as a new group
mutate(nextint=Var1+50,
newgroup=Var1>lag(nextint,default=-Inf),
grp=cumsum(newgroup)) %>%
# for each group, get the mean and a comma separated list of distinct Var3 values
group_by(grp) %>%
summarise(
grplbl=floor(max(Var1)/50)*50,
mu=mean(Var2),
mix=paste(collapse=",",unique(Var3))) %>%
# if mix (distinct Var3) has a comma in it, change from e.g. 'a,b' to 'mix'
mutate(mix=ifelse(grepl(',', mix), 'mixed', mix))
# A tibble: 3 x 4
grp grplbl mu mix
<int> <dbl> <dbl> <chr>
1 1 50 0.5 mixed
2 2 150 0.3 a
3 3 250 0.4 b
另一种dplyr
可能是:
df %>%
group_by(grp = cumsum(Var1 - lag(Var1, default = first(Var1)) > 50)) %>%
summarise(Var1 = mean(Var1),
Var2 = mean(Var2),
Var3 = ifelse(n_distinct(Var3) > 1, "mixed", Var3)) %>%
ungroup() %>%
select(-grp)
Var1 Var2 Var3
<dbl> <dbl> <chr>
1 50 0.5 mixed
2 150 0.3 a
3 250 0.4 b