将文本标记与单词列表匹配
Matching text tokens with list of word
我想将单词列表中的单词与文本匹配并将它们提取到新列中。
我有这个数据
df <- structure(list(ID = 1:3, Text = c(list("red car, car going, going to"), list("red ball, ball on, on street"), list("to be, be or, or not"))), class = "data.frame", row.names = c(NA, -3L))
ID Text
1 1 red car, car going, going to
2 2 red ball, ball on, on street
3 3 to be, be or, or not
而我这个重要词汇表
words <- c("car", "ball", "street", "dog", "frog")
我想要这样的df
ID Text Word
1 1 red car, car going, going to c("car","car")
2 2 red ball, ball on, on street c("ball", "ball", "street")
3 3 to be, be or, or not NA
我的尝试
df$Word <- lapply(df$Text, function(x) stringr::str_extract_all(x, "\b"%s+%words+%"\b"))
但它给了我一个长度为 5 的列表,而不仅仅是文本中的单词。
可能的解决方案:
library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
ID = c(1L, 2L, 3L),
Text = c("red car, car going, going to","red ball, ball on, on street",
"to be, be or, or not")
)
words <- c("car", "ball", "street", "dog", "frog")
df %>%
mutate(word = Text) %>%
separate_rows(word, sep = ",|\s") %>%
mutate(word = ifelse(word %in% words, word, NA)) %>%
drop_na(word) %>%
group_by(ID) %>%
summarise(word = str_c(word, collapse = ", "), .groups = "drop") %>%
left_join(df,., by=c("ID"))
#> ID Text word
#> 1 1 red car, car going, going to car, car
#> 2 2 red ball, ball on, on street ball, ball, street
#> 3 3 to be, be or, or not <NA>
我想将单词列表中的单词与文本匹配并将它们提取到新列中。
我有这个数据
df <- structure(list(ID = 1:3, Text = c(list("red car, car going, going to"), list("red ball, ball on, on street"), list("to be, be or, or not"))), class = "data.frame", row.names = c(NA, -3L))
ID Text
1 1 red car, car going, going to
2 2 red ball, ball on, on street
3 3 to be, be or, or not
而我这个重要词汇表
words <- c("car", "ball", "street", "dog", "frog")
我想要这样的df
ID Text Word
1 1 red car, car going, going to c("car","car")
2 2 red ball, ball on, on street c("ball", "ball", "street")
3 3 to be, be or, or not NA
我的尝试
df$Word <- lapply(df$Text, function(x) stringr::str_extract_all(x, "\b"%s+%words+%"\b"))
但它给了我一个长度为 5 的列表,而不仅仅是文本中的单词。
可能的解决方案:
library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
ID = c(1L, 2L, 3L),
Text = c("red car, car going, going to","red ball, ball on, on street",
"to be, be or, or not")
)
words <- c("car", "ball", "street", "dog", "frog")
df %>%
mutate(word = Text) %>%
separate_rows(word, sep = ",|\s") %>%
mutate(word = ifelse(word %in% words, word, NA)) %>%
drop_na(word) %>%
group_by(ID) %>%
summarise(word = str_c(word, collapse = ", "), .groups = "drop") %>%
left_join(df,., by=c("ID"))
#> ID Text word
#> 1 1 red car, car going, going to car, car
#> 2 2 red ball, ball on, on street ball, ball, street
#> 3 3 to be, be or, or not <NA>