如何根据条件从文本中提取字符串向量的所有实例

Question

我对 R 比较陌生，我正在尝试从文本（这是数据框中的一列）中提取一些字符串，并根据条件将它们与它们的名称（这是我的数据框中的另一列）一起存储下面：

我正在尝试做的一个简化示例如下：

textdf <- data.frame(names = letters[1:4], text = c("I'm trying to extract flowers from text", 
                                                "there are certain conditions on how to extract", 
                                                "this red rose is also nice-smelling", 
                                                "scarlet rose is also fine"))

extractdf <- data.frame(extractions = c("extract", "certain", "certain conditions", 
                                        "nice-smelling rose", "red rose"), 
                        synonyms = c(NA, NA, NA, NA, "scarlet rose"))

我要

查看 "extractions" 列并提取所有实例出现在我的 df 的 "text" 列中。
如果某行没有匹配，说如果没有匹配“red” 玫瑰”，我想寻找同义词，如果是“猩红色” 玫瑰".
对于具有相同 "FIRST" 个单词的短语我想提取最长的子字符串...例如，如果我同时拥有 "certain" 和 "certain conditions" 我想保留 "certain conditions".
也提取"nice-smelling rose"?
最后我想将所有的提取存储在一个单独的列中 df，或者获取命名列表也可以。

所以我需要的是这个

#result
textdf <- data.frame(names = letters[1:4], text = c("I'm trying to extract flowers from text", 
                                                "there are certain conditions on how to extract", 
                                                "this red rose is also nice-smelling", 
                                                "scarlet rose is also fine"), 
                     ex = c("extract", "certain conditions, extract", "nice-smelling rose, red rose", "scarlet rose"))

我试过：

##for the first item
library(rebus)
library(stringi)
sapply(textdf$text, function(x) stri_extract_all_regex(x, or1(extractdf$extractions)))

这会找到 "certain" 但不会找到 "certain conditions"

##for the second and fourth item
library(stringdist)
Match_Idx = amatch(textdf$text, extractdf$extractions, method = 'lcs', maxDist = Inf)
Matches = data.frame(textdf$text, extractdf$extractions[Match_Idx])

这很好，因为它提取了 "certain conditions" 和 "nice-smelling rose" 但问题是：如果文本中同时包含 "certain conditions" 和 "nice-smelling rose" 怎么办？我怎样才能让它同时找到两者？

我不知道第三个要做什么...我是否必须对文本和提取内容进行标记化并找到唯一的第一个单词，然后提取最长的匹配项？？？

如果您能帮助我解决任何问题，或者帮助我在自定义函数中解决所有问题，我将不胜感激，这样我才能最终得到我提取的所有内容。

Answer 1

您可以使用放入向量中的正则表达式，

rex <- c("(extract)", "((?>(?>red)|(?>scarlet))\srose)", 
         "(\bcertain\sconditions\b)", 
         "((?>rose).*(?>nice-smelling)|(?>nice-smelling).*(?>rose))")

创建匹配函数

fun <- function(x, y) regmatches(x, regexpr(y, x, perl=TRUE))

并应用 outer.

M <- outer(textdf$text, rex, Vectorize(fun))

现在我们应该稍微清理一下矩阵，这在一定程度上取决于您的数据，例如

M[grep("((?>rose)*.(?>nice-smelling)|(?>nice-smelling).*s(?>rose))", 
       M, perl=TRUE)] <- "nice-smelling rose"

最后折叠生成的矩阵并将新向量添加到数据框中。

textdf$ex <- apply(M, 1, function(x) toString(unlist(x)))

给予

textdf
#   names                                           text                           ex
# 1     a        I'm trying to extract flowers from text                      extract
# 2     b there are certain conditions on how to extract  extract, certain conditions
# 3     c            this red rose is also nice-smelling red rose, nice-smelling rose
# 4     d                      scarlet rose is also fine                 scarlet rose

如何根据条件从文本中提取字符串向量的所有实例

How to extract all instances of a vector of strings from text based on conditions

regex

string

text-extraction

r