tidyverse:过滤 str_detect
tidyverse: filter with str_detect
我想使用 dplyr
中的 filter
命令和 str_detect
。
library(tidyverse)
dt1 <-
tibble(
No = c(1, 2, 3, 4)
, Text = c("I have a pen.", "I have a book.", "I have a pencile.", "I have a pen and a book.")
)
dt1
# A tibble: 4 x 2
No Text
<dbl> <chr>
1 1 I have a pen.
2 2 I have a book.
3 3 I have a pencile.
4 4 I have a pen and a book.
MatchText <- c("Pen", "Book")
dt1 %>%
filter(str_detect(Text, regex(paste0(MatchText, collapse = '|'), ignore_case = TRUE)))
# A tibble: 4 x 2
No Text
<dbl> <chr>
1 1 I have a pen.
2 2 I have a book.
3 3 I have a pencile.
4 4 I have a pen and a book.
需要输出
我希望以更有效的方式输出以下内容(因为在我原来的问题中会有很多 MatchText 的未知元素)。
dt1 %>%
filter(str_detect(Text, regex("Pen", ignore_case = TRUE))) %>%
select(-Text) %>%
mutate(MatchText = "Pen") %>%
bind_rows(
dt1 %>%
filter(str_detect(Text, regex("Book", ignore_case = TRUE))) %>%
select(-Text) %>%
mutate(MatchText = "Book")
)
# A tibble: 5 x 2
No MatchText
<dbl> <chr>
1 1 Pen
2 3 Pen
3 4 Pen
4 2 Book
5 4 Book
任何更有效地完成上述任务的提示。
str_extract_all() 给出了多个匹配项,您可以将这些匹配项取消嵌套到单独的行中以获得所需的输出。如果你愿意,你仍然可以使用 paste+collapse 方法从矢量生成图案。
library(stringr)
dt1 %>%
mutate(match = str_extract_all(tolower(Text), "pen|book")) %>%
unnest(match) %>%
select(-Text)
library(tidyverse)
dt1 %>%
mutate(
result = str_extract_all(Text, regex(paste0("\b", MatchText, "\b", collapse = '|'),ignore_case = TRUE))
) %>%
unnest(result) %>%
select(-Text)
# # A tibble: 4 x 2
# No result
# <dbl> <chr>
# 1 1 pen
# 2 2 book
# 3 4 pen
# 4 4 book
我不确定编辑后问题的“全字”部分发生了什么 - 我留在字边界以匹配全字,但由于“笔”不是“的全字匹配” pencile”,我的结果与你的不匹配。如果您想要部分单词匹配,请去掉 \b
。
我想使用 dplyr
中的 filter
命令和 str_detect
。
library(tidyverse)
dt1 <-
tibble(
No = c(1, 2, 3, 4)
, Text = c("I have a pen.", "I have a book.", "I have a pencile.", "I have a pen and a book.")
)
dt1
# A tibble: 4 x 2
No Text
<dbl> <chr>
1 1 I have a pen.
2 2 I have a book.
3 3 I have a pencile.
4 4 I have a pen and a book.
MatchText <- c("Pen", "Book")
dt1 %>%
filter(str_detect(Text, regex(paste0(MatchText, collapse = '|'), ignore_case = TRUE)))
# A tibble: 4 x 2
No Text
<dbl> <chr>
1 1 I have a pen.
2 2 I have a book.
3 3 I have a pencile.
4 4 I have a pen and a book.
需要输出
我希望以更有效的方式输出以下内容(因为在我原来的问题中会有很多 MatchText 的未知元素)。
dt1 %>%
filter(str_detect(Text, regex("Pen", ignore_case = TRUE))) %>%
select(-Text) %>%
mutate(MatchText = "Pen") %>%
bind_rows(
dt1 %>%
filter(str_detect(Text, regex("Book", ignore_case = TRUE))) %>%
select(-Text) %>%
mutate(MatchText = "Book")
)
# A tibble: 5 x 2
No MatchText
<dbl> <chr>
1 1 Pen
2 3 Pen
3 4 Pen
4 2 Book
5 4 Book
任何更有效地完成上述任务的提示。
str_extract_all() 给出了多个匹配项,您可以将这些匹配项取消嵌套到单独的行中以获得所需的输出。如果你愿意,你仍然可以使用 paste+collapse 方法从矢量生成图案。
library(stringr)
dt1 %>%
mutate(match = str_extract_all(tolower(Text), "pen|book")) %>%
unnest(match) %>%
select(-Text)
library(tidyverse)
dt1 %>%
mutate(
result = str_extract_all(Text, regex(paste0("\b", MatchText, "\b", collapse = '|'),ignore_case = TRUE))
) %>%
unnest(result) %>%
select(-Text)
# # A tibble: 4 x 2
# No result
# <dbl> <chr>
# 1 1 pen
# 2 2 book
# 3 4 pen
# 4 4 book
我不确定编辑后问题的“全字”部分发生了什么 - 我留在字边界以匹配全字,但由于“笔”不是“的全字匹配” pencile”,我的结果与你的不匹配。如果您想要部分单词匹配,请去掉 \b
。