如何在 R 中保留文本列的特定词组或短语?

How to keep specific group of words or phrases of a text column in R?

我有一个带有文本列的数据框,我想创建另一个仅包含与文本列匹配的特定单词或短语的列。 假设我在数据框中有这 4 行:

   TEXT_COLUMN
1 "discovering the hidden themes in the collection."
2 "classifying the documents into the discovered themes."
3 "using the classification to organize/summarize/search the documents."
4 "alternatively, we can set a threshold on the score"

另一方面,我有一个我想保留的单词和短语列表。例如:

x <- c("hidden themes", "the documents", "discovered themes", "classification to organize", "search")

所以,我想创建一个新列“KEYWORDS”,其中包含“x”中的单词,这些单词与用逗号分隔的文本列相匹配:

   TEXT_COLUMN                                                             |  KEYWORDS
1 "discovering the hidden themes in the collection."                       |  "hidden themes"
2 "classifying the documents into the discovered themes."                  |  "the documents", "discovered themes"
3 "using the classification to organize/summarize/search the documents."   |  "classification to organize", "search"
4 "alternatively, we can set a threshold on the score"                     |  NA

你知道怎么做吗?

非常感谢您。

一个选项是通过加入 str_c

从 'x' 创建一个模式
library(stringr)
library(dplyr)
pat <- str_c("\b(", str_c(x, collapse="|"), ")\b")

然后,使用此模式,将 'TEXT_COLUMN' 中的子字符串提取到 vectors

list 列中
df1 <- df1 %>% 
      mutate(KEYWORDS = str_extract_all(TEXT_COLUMN, pat))

-输出

df1
#TEXT_COLUMN                                          KEYWORDS
#1                     discovering the hidden themes in the collection.                                     hidden themes
#2                classifying the documents into the discovered themes.                  the documents, discovered themes
#3 using the classification to organize/summarize/search the documents. classification to organize, search, the documents
#4                   alternatively, we can set a threshold on the score                                                  

数据

df1 <- structure(list(TEXT_COLUMN = c("discovering the hidden themes in the collection.", 
"classifying the documents into the discovered themes.", "using the classification to organize/summarize/search the documents.", 
"alternatively, we can set a threshold on the score")), 
class = "data.frame", row.names = c("1", 
"2", "3", "4"))