聚合 R 中每组的唯一字符值
Aggregating unique character values per group in R
我想按组汇总特定年份的所有关键字。
我有一个如下所示的数据集:
我的主要问题是 Words 列可以在 1 到 52 之间变化!我正在考虑将此列拆分为不同的列,然后使用 group_by。但现在我不确定如何进行。
我们可以把'Words'拆分成vector
的list
,unnest
拆分成'long'格式,去掉重复行,按[分组=22=], 'UID', paste
把 'Words' 变成一个字符串
library(dplyr)
df1 %>%
mutate(Words = strsplit(Words, ",")) %>%
unnest %>%
distinct(Year, UID, Words) %>%
group_by(UID, Year) %>%
summarise(Words = toString(Words))
# A tibble: 4 x 3
# Groups: UID [?]
# UID Year Words
# <dbl> <dbl> <chr>
#1 10 2009 ABC, CDEFGH, LMX, ABCD, IJKLM, PQRS, EFGH
#2 11 2010 BDFC, CDE, PQRS, ACCA, IJKLM
#3 12 2010 ABCD, CADDE
#4 12 2011 ABC, CDE, EFGH
数据
df1 <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 5), Year = c(2011, 2011,
2010, 2010, 2009, 2010, 2009), UID = c(12, 12, 11, 12, 10, 11,
10), Words = c("ABC,CDE", "EFGH,CDE", "BDFC,CDE,PQRS", "ABCD,CADDE",
"ABC,CDEFGH,LMX,ABCD,IJKLM,PQRS", "BDFC,ACCA,IJKLM", "EFGH")),
class = "data.frame", row.names = c(NA, -7L))
基础 R 方法 aggregate
:
df <- data.frame(
id = c(1:5, 6, 5),
year = c(2011, 2011, 2010, 2010, 2009, 2010, 2009),
uid = c(12, 12, 11, 12, 10, 11, 10),
words = c("abc,cde", "efgh,cde", "bdfc,cde,pqrs", "abcd,cadde", "abc,cdefgh,lmx,abcd,ijklm,pqrs","bdfc,acca,ijklm", "efgh"),
stringsAsFactors = FALSE
)
aggregate(df["words"], df[,c("year", "uid")], function(x) paste0(unique(unlist(strsplit(x, ","))), collapse=","))
我想按组汇总特定年份的所有关键字。
我有一个如下所示的数据集:
我的主要问题是 Words 列可以在 1 到 52 之间变化!我正在考虑将此列拆分为不同的列,然后使用 group_by。但现在我不确定如何进行。
我们可以把'Words'拆分成vector
的list
,unnest
拆分成'long'格式,去掉重复行,按[分组=22=], 'UID', paste
把 'Words' 变成一个字符串
library(dplyr)
df1 %>%
mutate(Words = strsplit(Words, ",")) %>%
unnest %>%
distinct(Year, UID, Words) %>%
group_by(UID, Year) %>%
summarise(Words = toString(Words))
# A tibble: 4 x 3
# Groups: UID [?]
# UID Year Words
# <dbl> <dbl> <chr>
#1 10 2009 ABC, CDEFGH, LMX, ABCD, IJKLM, PQRS, EFGH
#2 11 2010 BDFC, CDE, PQRS, ACCA, IJKLM
#3 12 2010 ABCD, CADDE
#4 12 2011 ABC, CDE, EFGH
数据
df1 <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 5), Year = c(2011, 2011,
2010, 2010, 2009, 2010, 2009), UID = c(12, 12, 11, 12, 10, 11,
10), Words = c("ABC,CDE", "EFGH,CDE", "BDFC,CDE,PQRS", "ABCD,CADDE",
"ABC,CDEFGH,LMX,ABCD,IJKLM,PQRS", "BDFC,ACCA,IJKLM", "EFGH")),
class = "data.frame", row.names = c(NA, -7L))
基础 R 方法 aggregate
:
df <- data.frame(
id = c(1:5, 6, 5),
year = c(2011, 2011, 2010, 2010, 2009, 2010, 2009),
uid = c(12, 12, 11, 12, 10, 11, 10),
words = c("abc,cde", "efgh,cde", "bdfc,cde,pqrs", "abcd,cadde", "abc,cdefgh,lmx,abcd,ijklm,pqrs","bdfc,acca,ijklm", "efgh"),
stringsAsFactors = FALSE
)
aggregate(df["words"], df[,c("year", "uid")], function(x) paste0(unique(unlist(strsplit(x, ","))), collapse=","))