如何在 R 中保留文本列的特定词组或短语?
How to keep specific group of words or phrases of a text column in R?
我有一个带有文本列的数据框,我想创建另一个仅包含与文本列匹配的特定单词或短语的列。
假设我在数据框中有这 4 行:
TEXT_COLUMN
1 "discovering the hidden themes in the collection."
2 "classifying the documents into the discovered themes."
3 "using the classification to organize/summarize/search the documents."
4 "alternatively, we can set a threshold on the score"
另一方面,我有一个我想保留的单词和短语列表。例如:
x <- c("hidden themes", "the documents", "discovered themes", "classification to organize", "search")
所以,我想创建一个新列“KEYWORDS”,其中包含“x”中的单词,这些单词与用逗号分隔的文本列相匹配:
TEXT_COLUMN | KEYWORDS
1 "discovering the hidden themes in the collection." | "hidden themes"
2 "classifying the documents into the discovered themes." | "the documents", "discovered themes"
3 "using the classification to organize/summarize/search the documents." | "classification to organize", "search"
4 "alternatively, we can set a threshold on the score" | NA
你知道怎么做吗?
非常感谢您。
一个选项是通过加入 str_c
从 'x' 创建一个模式
library(stringr)
library(dplyr)
pat <- str_c("\b(", str_c(x, collapse="|"), ")\b")
然后,使用此模式,将 'TEXT_COLUMN' 中的子字符串提取到 vector
s
的 list
列中
df1 <- df1 %>%
mutate(KEYWORDS = str_extract_all(TEXT_COLUMN, pat))
-输出
df1
#TEXT_COLUMN KEYWORDS
#1 discovering the hidden themes in the collection. hidden themes
#2 classifying the documents into the discovered themes. the documents, discovered themes
#3 using the classification to organize/summarize/search the documents. classification to organize, search, the documents
#4 alternatively, we can set a threshold on the score
数据
df1 <- structure(list(TEXT_COLUMN = c("discovering the hidden themes in the collection.",
"classifying the documents into the discovered themes.", "using the classification to organize/summarize/search the documents.",
"alternatively, we can set a threshold on the score")),
class = "data.frame", row.names = c("1",
"2", "3", "4"))
我有一个带有文本列的数据框,我想创建另一个仅包含与文本列匹配的特定单词或短语的列。 假设我在数据框中有这 4 行:
TEXT_COLUMN
1 "discovering the hidden themes in the collection."
2 "classifying the documents into the discovered themes."
3 "using the classification to organize/summarize/search the documents."
4 "alternatively, we can set a threshold on the score"
另一方面,我有一个我想保留的单词和短语列表。例如:
x <- c("hidden themes", "the documents", "discovered themes", "classification to organize", "search")
所以,我想创建一个新列“KEYWORDS”,其中包含“x”中的单词,这些单词与用逗号分隔的文本列相匹配:
TEXT_COLUMN | KEYWORDS
1 "discovering the hidden themes in the collection." | "hidden themes"
2 "classifying the documents into the discovered themes." | "the documents", "discovered themes"
3 "using the classification to organize/summarize/search the documents." | "classification to organize", "search"
4 "alternatively, we can set a threshold on the score" | NA
你知道怎么做吗?
非常感谢您。
一个选项是通过加入 str_c
library(stringr)
library(dplyr)
pat <- str_c("\b(", str_c(x, collapse="|"), ")\b")
然后,使用此模式,将 'TEXT_COLUMN' 中的子字符串提取到 vector
s
list
列中
df1 <- df1 %>%
mutate(KEYWORDS = str_extract_all(TEXT_COLUMN, pat))
-输出
df1
#TEXT_COLUMN KEYWORDS
#1 discovering the hidden themes in the collection. hidden themes
#2 classifying the documents into the discovered themes. the documents, discovered themes
#3 using the classification to organize/summarize/search the documents. classification to organize, search, the documents
#4 alternatively, we can set a threshold on the score
数据
df1 <- structure(list(TEXT_COLUMN = c("discovering the hidden themes in the collection.",
"classifying the documents into the discovered themes.", "using the classification to organize/summarize/search the documents.",
"alternatively, we can set a threshold on the score")),
class = "data.frame", row.names = c("1",
"2", "3", "4"))