R 的 grepl() 查找多个字符串存在

Question

grepl("instance|percentage", labelTest$Text)

如果存在 instance 或 percentage 中的任何一个，

将 return 为真。

只有当这两个术语都存在时，我如何才能得到 true？

Answer 1

Text <- c("instance", "percentage", "n", 
          "instance percentage", "percentage instance")

grepl("instance|percentage", Text)
# TRUE  TRUE FALSE  TRUE  TRUE

grepl("instance.*percentage|percentage.*instance", Text)
# FALSE FALSE FALSE TRUE  TRUE

后一个通过查找工作：

('instance')(any character sequence)('percentage')  
OR  
('percentage')(any character sequence)('instance')

当然，如果您需要查找两个以上单词的任意组合，这将变得相当复杂。那么评论中提到的解决方案将更容易实现和阅读。

匹配多个词时可能相关的另一种替代方法是使用正向预测（可以认为是 'non-consuming' 匹配）。为此，您必须激活 perl 正则表达式。

# create a vector of word combinations
set.seed(1)
words <- c("instance", "percentage", "element",
           "character", "n", "o", "p")
Text2 <- replicate(10, paste(sample(words, 5), collapse=" "))

# grepl with multiple positive look-ahead
longperl <- grepl("(?=.*instance)(?=.*percentage)(?=.*element)(?=.*character)",
  Text2, perl=TRUE)

# this is equivalent to the solution proposed in the comments
longstrd <- grepl("instance", Text2) & 
          grepl("percentage", Text2) & 
             grepl("element", Text2) & 
           grepl("character", Text2)

# they produce identical results
identical(longperl, longstrd)

此外，如果将模式存储在向量中，则可以显着压缩表达式，从而

pat <- c("instance", "percentage", "element", "character")

longperl <- grepl(paste0("(?=.*", pat, ")", collapse=""), Text2, perl=TRUE)
longstrd <- rowSums(sapply(pat, grepl, Text2) - 1L) == 0L

正如评论中所要求的，如果你想匹配精确的单词，即不匹配子字符串，我们可以使用 \b 指定单词边界。例如：

tx <- c("cent element", "percentage element", "element cent", "element centimetre")

grepl("(?=.*\bcent\b)(?=.*element)", tx, perl=TRUE)
# TRUE FALSE  TRUE FALSE
grepl("element", tx) & grepl("\bcent\b", tx)
# TRUE FALSE  TRUE FALSE

Answer 2

使用 intersect 并为每个单词提供 grep：

library(data.table) #used for subsetting text vector below

vector_of_text[
  intersect(
    grep(vector_of_text , pattern = "pattern1"),
    grep(vector_of_text , pattern = "pattern2")
  )
]

Answer 3

如果这两个项都出现在向量 "labelTest$Text" 的项目中，这就是您将仅获得 "TRUE" 的方式。我认为这是问题的确切答案，并且比其他解决方案短得多。

grepl("instance",labelTest$Text) & grepl("percentage",labelTest$Text)

R 的 grepl() 查找多个字符串存在

R's grepl() to find multiple strings exists

r

grepl