如何在 R 中保留文本列的特定词组或短语？

Question

我有一个带有文本列的数据框，我想创建另一个仅包含与文本列匹配的特定单词或短语的列。假设我在数据框中有这 4 行：

   TEXT_COLUMN
1 "discovering the hidden themes in the collection."
2 "classifying the documents into the discovered themes."
3 "using the classification to organize/summarize/search the documents."
4 "alternatively, we can set a threshold on the score"

另一方面，我有一个我想保留的单词和短语列表。例如：

x <- c("hidden themes", "the documents", "discovered themes", "classification to organize", "search")

所以，我想创建一个新列“KEYWORDS”，其中包含“x”中的单词，这些单词与用逗号分隔的文本列相匹配：

   TEXT_COLUMN                                                             |  KEYWORDS
1 "discovering the hidden themes in the collection."                       |  "hidden themes"
2 "classifying the documents into the discovered themes."                  |  "the documents", "discovered themes"
3 "using the classification to organize/summarize/search the documents."   |  "classification to organize", "search"
4 "alternatively, we can set a threshold on the score"                     |  NA

你知道怎么做吗？

非常感谢您。

Answer 1

一个选项是通过加入 str_c

从 'x' 创建一个模式

library(stringr)
library(dplyr)
pat <- str_c("\b(", str_c(x, collapse="|"), ")\b")

然后，使用此模式，将 'TEXT_COLUMN' 中的子字符串提取到 vectors

的 list 列中

df1 <- df1 %>% 
      mutate(KEYWORDS = str_extract_all(TEXT_COLUMN, pat))

-输出

df1
#TEXT_COLUMN                                          KEYWORDS
#1                     discovering the hidden themes in the collection.                                     hidden themes
#2                classifying the documents into the discovered themes.                  the documents, discovered themes
#3 using the classification to organize/summarize/search the documents. classification to organize, search, the documents
#4                   alternatively, we can set a threshold on the score

数据

df1 <- structure(list(TEXT_COLUMN = c("discovering the hidden themes in the collection.", 
"classifying the documents into the discovered themes.", "using the classification to organize/summarize/search the documents.", 
"alternatively, we can set a threshold on the score")), 
class = "data.frame", row.names = c("1", 
"2", "3", "4"))

如何在 R 中保留文本列的特定词组或短语？

How to keep specific group of words or phrases of a text column in R?

r

text-mining

数据