使用 R 中的 Grepl 查找数据框列中存在的单词列表
Finding list of word present in column of a Dataframe using Grepl in R
我有一个数据框 df:
df <- structure(list(page = c(12, 6, 9, 65),
text = structure(c(4L,2L, 1L, 3L),
.Label = c("I just bought a brand new AudiA6", "Get 2 years engine replacement warranty on BMW X6",
"Volkswagen is the parent company of BMW", "ToyotaCorolla is offering new car exchange offers"),
class = "factor")), .Names = c("page","text"), row.names = c(NA, -4L), class = "data.frame")
此外,我有一个单词列表:
wordlist <- c("Audi", "BMW", "extended", "engine", "replacement", "Volkswagen", "company", "Toyota","exchange", "brand")
我通过取消列出文本并使用 grepl 来查找 wordlist 中的单词是否出现在列文本中。
library(data.table)
setDT(df)[, match := paste(wordlist[unlist(lapply(wordlist, function(x) grepl(x, text, ignore.case = T)))], collapse = ","), by = 1:nrow(df)]
问题是,我想找到列文本中存在的单词表的确切单词。
使用 grepl,它还显示部分匹配的单词,例如文本中的 AudiA6 也与单词列表中出现的单词 Audi 部分匹配。此外,我的数据框非常大,使用 grepl 在 运行 代码中花费很多时间。请尽可能推荐任何其他方法。我想要这样的东西:
df <- structure(list(page = c(12, 6, 9, 65),
text = structure(c(4L,2L, 1L, 3L),
.Label = c("I just bought a brand new AudiA6", "Get 2 years engine replacement warranty on BMW X6",
"Volkswagen is the parent company of BMW", "ToyotaCorolla is offering new car exchange offers"),
class = "factor"), match = c("exchange", "BMW,engine,replacement",
"brand", "BMW,Volkswagen,company")), row.names = c(NA, -4L),
class = c("data.table", "data.frame"))
您可以在为要提取的每个单词添加单词边界 (\b
) 后使用 stringr
中的 str_extract_all
,因此只考虑完全匹配(并且您需要折叠所有带有 "|"
的单词以表示 "or"):
sapply(stringr::str_extract_all(df$text, paste("\b", wordlist, "\b", sep="", collapse="|")), paste, collapse=",")
# [1] "exchange" "engine,replacement,BMW" "brand" "Volkswagen,company,BMW"
如果你想把它放在你的 data.table
:
df[, match:=sapply(stringr::str_extract_all(text, paste("\b", wordlist, "\b", sep="", collapse="|")), paste, collapse=",")]
df
# page text match
#1: 12 ToyotaCorolla is offering new car exchange offers exchange
#2: 6 Get 2 years engine replacement warranty on BMW X6 engine,replacement,BMW
#3: 9 I just bought a brand new AudiA6 brand
#4: 65 Volkswagen is the parent company of BMW Volkswagen,company,BMW
我有一个数据框 df:
df <- structure(list(page = c(12, 6, 9, 65),
text = structure(c(4L,2L, 1L, 3L),
.Label = c("I just bought a brand new AudiA6", "Get 2 years engine replacement warranty on BMW X6",
"Volkswagen is the parent company of BMW", "ToyotaCorolla is offering new car exchange offers"),
class = "factor")), .Names = c("page","text"), row.names = c(NA, -4L), class = "data.frame")
此外,我有一个单词列表:
wordlist <- c("Audi", "BMW", "extended", "engine", "replacement", "Volkswagen", "company", "Toyota","exchange", "brand")
我通过取消列出文本并使用 grepl 来查找 wordlist 中的单词是否出现在列文本中。
library(data.table)
setDT(df)[, match := paste(wordlist[unlist(lapply(wordlist, function(x) grepl(x, text, ignore.case = T)))], collapse = ","), by = 1:nrow(df)]
问题是,我想找到列文本中存在的单词表的确切单词。 使用 grepl,它还显示部分匹配的单词,例如文本中的 AudiA6 也与单词列表中出现的单词 Audi 部分匹配。此外,我的数据框非常大,使用 grepl 在 运行 代码中花费很多时间。请尽可能推荐任何其他方法。我想要这样的东西:
df <- structure(list(page = c(12, 6, 9, 65),
text = structure(c(4L,2L, 1L, 3L),
.Label = c("I just bought a brand new AudiA6", "Get 2 years engine replacement warranty on BMW X6",
"Volkswagen is the parent company of BMW", "ToyotaCorolla is offering new car exchange offers"),
class = "factor"), match = c("exchange", "BMW,engine,replacement",
"brand", "BMW,Volkswagen,company")), row.names = c(NA, -4L),
class = c("data.table", "data.frame"))
您可以在为要提取的每个单词添加单词边界 (\b
) 后使用 stringr
中的 str_extract_all
,因此只考虑完全匹配(并且您需要折叠所有带有 "|"
的单词以表示 "or"):
sapply(stringr::str_extract_all(df$text, paste("\b", wordlist, "\b", sep="", collapse="|")), paste, collapse=",")
# [1] "exchange" "engine,replacement,BMW" "brand" "Volkswagen,company,BMW"
如果你想把它放在你的 data.table
:
df[, match:=sapply(stringr::str_extract_all(text, paste("\b", wordlist, "\b", sep="", collapse="|")), paste, collapse=",")]
df
# page text match
#1: 12 ToyotaCorolla is offering new car exchange offers exchange
#2: 6 Get 2 years engine replacement warranty on BMW X6 engine,replacement,BMW
#3: 9 I just bought a brand new AudiA6 brand
#4: 65 Volkswagen is the parent company of BMW Volkswagen,company,BMW