R,dplyr:将出现次数作为值分配给几个 group_by() 级别的列
R, dplyr: assign number of occurence as value to column at several group_by() levels
require(plyr)
require(dplyr)
set.seed(8)
df <- data.frame(
group = sample(c("A","B"), 10, replace=T),
subgroup = sample(c("a", "b", "c"),10, replace=T),
value = runif(10, -1,1)
)
df %>% arrange(group,subgroup)
给出:
group subgroup value
1 A a -0.1841505
2 A a 0.3265360
3 A a -0.8045035
4 A b -0.5526222
5 B a 0.2238653
6 B a 0.0552373
7 B b 0.2297515
8 B b -0.5700525
9 B b 0.6347312
10 B c 0.9550054
我可以指出值是高还是低,例如:
df2<-
df %>% mutate(reg = ifelse(value > 0, "high", "low"))
df2
给出:
group subgroup value reg
1 A b -0.5526222 low
2 A a -0.1841505 low
3 B b 0.2297515 high
4 B b -0.5700525 low
5 A a 0.3265360 high
6 B c 0.9550054 high
7 A a -0.8045035 low
8 B a 0.2238653 high
9 B a 0.0552373 high
10 B b 0.6347312 high
问题:
我想获得 low.group
、high.group
、low.subgroup
和 high.subgroup
列,指示在该组中找到多少次高值和低值(我想到了 dplyr
的 group_by(group)
和 n()
,可能与 summarise()
) 和组+子组级别 (group_by(group, subgroup)
)。这将生成一个 6 行乘 6 列的数据框(A/B 和 a/b/c 的组合,以及列 group
、subgroup
、low.group
、high.group
,low.subgroup
和 high.subgroup
)。第一列应为 (A, a, 3, 1, 2, 1),第二列应为 (A, b, 3, 1, 1, 0) 等。
我可以数数,例如通过:
df %>%
group_by(group,reg) %>%
mutate(n.group=n())
但是我如何将 n.group
分成两列 low.group
和 high.group
。子组也有同样的问题。
我确信 plyr
、dplyr
和 reshape2
中的函数可以进行这种组合计数和汇总,但是怎么做呢?
更新:
这是我会得到的手工结果:
group subgroup low.group high.group low.subgroup high.subgroup
A a 3 1 2 1
A b 3 1 1 0
A c 3 1 0 0
B a 1 5 0 1
B b 1 5 1 2
B c 1 5 0 1
有点冗长,但似乎符合预期:
library(dplyr)
library(tidyr)
df %>%
mutate(value = ifelse(value > 0, "high", "low")) %>%
group_by(group, subgroup, value) %>%
mutate(sub = n()) %>%
group_by(group, value) %>%
mutate(grp = n()) %>%
distinct(group, subgroup, value) %>%
gather(key, val, sub:grp) %>%
unite(x, value:key, sep = ".") %>%
spread(x, val, fill = 0)
#Source: local data frame [5 x 6]
#
# group subgroup high.grp high.sub low.grp low.sub
#1 A a 1 1 3 2
#2 A b 0 0 3 1
#3 B a 5 2 0 0
#4 B b 5 2 1 1
#5 B c 5 1 0 0
请注意,A-c 组合不会出现在示例数据中,因此不会出现在输出中。
docendo discimus 解决方案的变体 - 使用更多的 reshape2 和更少的 tidyr - 是:
library(dplyr)
library(tidyr)
library(stringr)
library(reshape2)
df %>%
mutate(value=ifelse(value > 0, "high", "low")) %>%
group_by(group, subgroup, value) %>%
mutate(sub = n()) %>%
group_by(group, value) %>%
mutate(grp = n()) %>%
gather(key,val,sub:grp) %>%
mutate(val.key=str_c(value,".",key)) %>%
distinct() %>%
dcast(group+subgroup~val.key, value.var="val", fill=0)
require(plyr)
require(dplyr)
set.seed(8)
df <- data.frame(
group = sample(c("A","B"), 10, replace=T),
subgroup = sample(c("a", "b", "c"),10, replace=T),
value = runif(10, -1,1)
)
df %>% arrange(group,subgroup)
给出:
group subgroup value
1 A a -0.1841505
2 A a 0.3265360
3 A a -0.8045035
4 A b -0.5526222
5 B a 0.2238653
6 B a 0.0552373
7 B b 0.2297515
8 B b -0.5700525
9 B b 0.6347312
10 B c 0.9550054
我可以指出值是高还是低,例如:
df2<-
df %>% mutate(reg = ifelse(value > 0, "high", "low"))
df2
给出:
group subgroup value reg
1 A b -0.5526222 low
2 A a -0.1841505 low
3 B b 0.2297515 high
4 B b -0.5700525 low
5 A a 0.3265360 high
6 B c 0.9550054 high
7 A a -0.8045035 low
8 B a 0.2238653 high
9 B a 0.0552373 high
10 B b 0.6347312 high
问题:
我想获得 low.group
、high.group
、low.subgroup
和 high.subgroup
列,指示在该组中找到多少次高值和低值(我想到了 dplyr
的 group_by(group)
和 n()
,可能与 summarise()
) 和组+子组级别 (group_by(group, subgroup)
)。这将生成一个 6 行乘 6 列的数据框(A/B 和 a/b/c 的组合,以及列 group
、subgroup
、low.group
、high.group
,low.subgroup
和 high.subgroup
)。第一列应为 (A, a, 3, 1, 2, 1),第二列应为 (A, b, 3, 1, 1, 0) 等。
我可以数数,例如通过:
df %>%
group_by(group,reg) %>%
mutate(n.group=n())
但是我如何将 n.group
分成两列 low.group
和 high.group
。子组也有同样的问题。
我确信 plyr
、dplyr
和 reshape2
中的函数可以进行这种组合计数和汇总,但是怎么做呢?
更新: 这是我会得到的手工结果:
group subgroup low.group high.group low.subgroup high.subgroup
A a 3 1 2 1
A b 3 1 1 0
A c 3 1 0 0
B a 1 5 0 1
B b 1 5 1 2
B c 1 5 0 1
有点冗长,但似乎符合预期:
library(dplyr)
library(tidyr)
df %>%
mutate(value = ifelse(value > 0, "high", "low")) %>%
group_by(group, subgroup, value) %>%
mutate(sub = n()) %>%
group_by(group, value) %>%
mutate(grp = n()) %>%
distinct(group, subgroup, value) %>%
gather(key, val, sub:grp) %>%
unite(x, value:key, sep = ".") %>%
spread(x, val, fill = 0)
#Source: local data frame [5 x 6]
#
# group subgroup high.grp high.sub low.grp low.sub
#1 A a 1 1 3 2
#2 A b 0 0 3 1
#3 B a 5 2 0 0
#4 B b 5 2 1 1
#5 B c 5 1 0 0
请注意,A-c 组合不会出现在示例数据中,因此不会出现在输出中。
docendo discimus 解决方案的变体 - 使用更多的 reshape2 和更少的 tidyr - 是:
library(dplyr)
library(tidyr)
library(stringr)
library(reshape2)
df %>%
mutate(value=ifelse(value > 0, "high", "low")) %>%
group_by(group, subgroup, value) %>%
mutate(sub = n()) %>%
group_by(group, value) %>%
mutate(grp = n()) %>%
gather(key,val,sub:grp) %>%
mutate(val.key=str_c(value,".",key)) %>%
distinct() %>%
dcast(group+subgroup~val.key, value.var="val", fill=0)