检测 R 中字符串的一部分(不完全匹配)
Detect part of a string in R (not exact match)
考虑以下数据集:
a <- c("my house", "green", "the cat is", "a girl")
b <- c("my beautiful house is cool", "the apple is green", "I m looking at the cat that is sleeping", "a boy")
c <- c("T", "T", "T", "F")
df <- data.frame(string1=a, string2=b, returns=c)
我正在尝试检测字符串 2 中的字符串 1,但我的目标不仅是检测精确匹配。我正在寻找一种方法来检测 string2 中是否存在 string1 单词,无论单词出现的顺序如何。例如,在搜索 "my house".
时,字符串 "my beautiful house is cool" 应该 return TRUE
我已尝试在示例数据集上方的 "return" 列中说明脚本的预期行为。
我试过 grepl() 和 str_detect() 函数,但它只适用于完全匹配。你能帮忙吗?提前致谢
这里的技巧是不要按原样使用 str_detect,而是先将 search_words
拆分成单独的单词。这是在下面的 strsplit()
中完成的。然后我们将其传递给 str_detect
以检查 所有 个单词是否匹配。
library(stringr)
search_words <- c("my house", "green", "the cat is", "a girl")
words <- c("my beautiful house is cool", "the apple is green", "I m looking at the cat that is sleeping", "a boy")
patterns <- strsplit(search_words," ")
mapply(function(word,string) all(str_detect(word,string)),words,patterns)
一个 base R
不涉及拆分的选项可能是:
n_words <- lengths(regmatches(df[, 1], gregexpr(" ", df[, 1], fixed = TRUE))) + 1
n_matches <- mapply(FUN = function(x, y) lengths(regmatches(x, gregexpr(y, x))),
df[, 2],
gsub(" ", "|", df[, 1], fixed = TRUE),
USE.NAMES = FALSE)
n_matches == n_words
[1] TRUE TRUE TRUE FALSE
然而,它假设 string1
中每行至少有一个词
考虑以下数据集:
a <- c("my house", "green", "the cat is", "a girl")
b <- c("my beautiful house is cool", "the apple is green", "I m looking at the cat that is sleeping", "a boy")
c <- c("T", "T", "T", "F")
df <- data.frame(string1=a, string2=b, returns=c)
我正在尝试检测字符串 2 中的字符串 1,但我的目标不仅是检测精确匹配。我正在寻找一种方法来检测 string2 中是否存在 string1 单词,无论单词出现的顺序如何。例如,在搜索 "my house".
时,字符串 "my beautiful house is cool" 应该 return TRUE我已尝试在示例数据集上方的 "return" 列中说明脚本的预期行为。
我试过 grepl() 和 str_detect() 函数,但它只适用于完全匹配。你能帮忙吗?提前致谢
这里的技巧是不要按原样使用 str_detect,而是先将 search_words
拆分成单独的单词。这是在下面的 strsplit()
中完成的。然后我们将其传递给 str_detect
以检查 所有 个单词是否匹配。
library(stringr)
search_words <- c("my house", "green", "the cat is", "a girl")
words <- c("my beautiful house is cool", "the apple is green", "I m looking at the cat that is sleeping", "a boy")
patterns <- strsplit(search_words," ")
mapply(function(word,string) all(str_detect(word,string)),words,patterns)
一个 base R
不涉及拆分的选项可能是:
n_words <- lengths(regmatches(df[, 1], gregexpr(" ", df[, 1], fixed = TRUE))) + 1
n_matches <- mapply(FUN = function(x, y) lengths(regmatches(x, gregexpr(y, x))),
df[, 2],
gsub(" ", "|", df[, 1], fixed = TRUE),
USE.NAMES = FALSE)
n_matches == n_words
[1] TRUE TRUE TRUE FALSE
然而,它假设 string1