如何使用 r 中的 dplyr 将具有条件的多行折叠成一行?
How to collapse multiple rows with condition into one row using dplyr in r?
我将用一个例子来说明我的问题。
示例数据:
df <- data.frame(ID = 1:5, Description = c("'foo' is a dog", "'bar' is a dog", "'foo' is a cat", "'foo' is not a cat", "'bar' is a fish"), Category = c("A", "A", "B", "B", "C"))
> df
ID Description Category
1 1 'foo' is a dog A
2 2 'bar' is a dog A
3 3 'foo' is a cat B
4 4 'foo' is not a cat B
5 5 'bar' is a fish C
我想做的是为同一个类别折叠类似description/ID,预期输出:
ID Category Description
1 3 B ‘foo’ is a cat
2 1,2 A ‘foo,bar’ is a dog
3 5 C ‘bar’ is a fish
4 4 B ‘foo’ is not a cat
我想开始使用 dplyr,但我无法完全了解如何实现这一点,有人可以帮我吗?
df %>%
group_by(Category) %>%
## some condition to check if content outside of single quote are the same.
## If so, collapse them into one row, otherwise, leave as it is.
## The regex to get the content outside of single quote
`gsub("^'(.*?)'.*", "\2", x)`
## then collapse
summarise(new description = paste())
只要弄清楚,请随时提出更好的解决方案:
df %>%
mutate(sec = gsub("^'.*?'(.*)", "\1", Description),
content = gsub("^'(.*?)'.*", "\1", Description)) %>%
group_by(sec, Category) %>%
summarise(
ID=str_c(unique(ID), collapse=","),
content=str_c(unique(content), collapse=",")) %>%
mutate(Description=str_c(sQuote(content), sec)) %>%
ungroup() %>%
dplyr::select(ID, Category, Description)
这是实现输出的另一种方法。
library(tidyverse)
df %>%
mutate(value = str_extract(Description, "'\w+'"),
Description = trimws(str_remove(Description, value))) %>%
group_by(Description, Category) %>%
summarise(ID = toString(ID),
value = sprintf("'%s'", toString(gsub("'", "", value)))) %>%
unite(Description, value, Description, sep = ' ')
# Description Category ID
# <chr> <chr> <chr>
#1 'foo' is a cat B 3
#2 'foo, bar' is a dog A 1, 2
#3 'bar' is a fish C 5
#4 'foo' is not a cat B 4
我将用一个例子来说明我的问题。
示例数据:
df <- data.frame(ID = 1:5, Description = c("'foo' is a dog", "'bar' is a dog", "'foo' is a cat", "'foo' is not a cat", "'bar' is a fish"), Category = c("A", "A", "B", "B", "C"))
> df
ID Description Category
1 1 'foo' is a dog A
2 2 'bar' is a dog A
3 3 'foo' is a cat B
4 4 'foo' is not a cat B
5 5 'bar' is a fish C
我想做的是为同一个类别折叠类似description/ID,预期输出:
ID Category Description
1 3 B ‘foo’ is a cat
2 1,2 A ‘foo,bar’ is a dog
3 5 C ‘bar’ is a fish
4 4 B ‘foo’ is not a cat
我想开始使用 dplyr,但我无法完全了解如何实现这一点,有人可以帮我吗?
df %>%
group_by(Category) %>%
## some condition to check if content outside of single quote are the same.
## If so, collapse them into one row, otherwise, leave as it is.
## The regex to get the content outside of single quote
`gsub("^'(.*?)'.*", "\2", x)`
## then collapse
summarise(new description = paste())
只要弄清楚,请随时提出更好的解决方案:
df %>%
mutate(sec = gsub("^'.*?'(.*)", "\1", Description),
content = gsub("^'(.*?)'.*", "\1", Description)) %>%
group_by(sec, Category) %>%
summarise(
ID=str_c(unique(ID), collapse=","),
content=str_c(unique(content), collapse=",")) %>%
mutate(Description=str_c(sQuote(content), sec)) %>%
ungroup() %>%
dplyr::select(ID, Category, Description)
这是实现输出的另一种方法。
library(tidyverse)
df %>%
mutate(value = str_extract(Description, "'\w+'"),
Description = trimws(str_remove(Description, value))) %>%
group_by(Description, Category) %>%
summarise(ID = toString(ID),
value = sprintf("'%s'", toString(gsub("'", "", value)))) %>%
unite(Description, value, Description, sep = ' ')
# Description Category ID
# <chr> <chr> <chr>
#1 'foo' is a cat B 3
#2 'foo, bar' is a dog A 1, 2
#3 'bar' is a fish C 5
#4 'foo' is not a cat B 4