检测 R 中字符串的一部分（不完全匹配）

Question

考虑以下数据集：

a <- c("my house", "green", "the cat is", "a girl")
b <- c("my beautiful house is cool", "the apple is green", "I m looking at the cat that is sleeping", "a boy")
c <- c("T", "T", "T", "F")
df <- data.frame(string1=a, string2=b, returns=c)

我正在尝试检测字符串 2 中的字符串 1，但我的目标不仅是检测精确匹配。我正在寻找一种方法来检测 string2 中是否存在 string1 单词，无论单词出现的顺序如何。例如，在搜索 "my house".

时，字符串 "my beautiful house is cool" 应该 return TRUE

我已尝试在示例数据集上方的 "return" 列中说明脚本的预期行为。

我试过 grepl() 和 str_detect() 函数，但它只适用于完全匹配。你能帮忙吗？提前致谢

Answer 1

这里的技巧是不要按原样使用 str_detect，而是先将 search_words 拆分成单独的单词。这是在下面的 strsplit() 中完成的。然后我们将其传递给 str_detect 以检查所有个单词是否匹配。

library(stringr)
search_words <- c("my house", "green", "the cat is", "a girl")
words <- c("my beautiful house is cool", "the apple is green", "I m looking at the cat that is sleeping", "a boy")

patterns <- strsplit(search_words," ")

mapply(function(word,string) all(str_detect(word,string)),words,patterns)

Answer 2

一个 base R 不涉及拆分的选项可能是：

n_words <- lengths(regmatches(df[, 1], gregexpr(" ", df[, 1], fixed = TRUE))) + 1

n_matches <- mapply(FUN = function(x, y) lengths(regmatches(x, gregexpr(y, x))), 
                    df[, 2],
                    gsub(" ", "|", df[, 1], fixed = TRUE),
                    USE.NAMES = FALSE)

n_matches == n_words

[1]  TRUE  TRUE  TRUE FALSE

然而，它假设 string1

中每行至少有一个词

检测 R 中字符串的一部分（不完全匹配）

Detect part of a string in R (not exact match)

string

r

text-mining

stringr

grepl