基于多个字符串的部分匹配的 R 数据帧中的子集行

Question

我不认为有人问过这个确切的问题 - 很多关于基于一个值（即 x[grepl("some string", x[["column1"]]),]）的子集化的东西，但不是多个 values/strings.

这是我的数据示例：

#create sample data frame
data = data.frame(id = c(1,2,3,4), phrase = c("dog, frog, cat, moose", "horse, bunny, mouse", "armadillo, cat, bird,", "monkey, chimp, cow"))

#convert the `phrase` column to character string (the dataset I'm working on requires this)
data$phrase = data$phrase

#list of strings to remove rows by
remove_if = c("dog", "cat")

这将给出一个如下所示的数据集：

  id                phrase
1  1 dog, frog, cat, moose
2  2   horse, bunny, mouse
3  3 armadillo, cat, bird,
4  4    monkey, chimp, cow

我想删除第 1 行和第 3 行（因为第 1 行包含 "dog"，第 3 行包含 "cat"），但保留第 2 行和第 4 行。

  id                phrase
1  2   horse, bunny, mouse
2  4    monkey, chimp, cow

换句话说，我想对 data 进行子集化，以便它只是（headers 和）第 2 行和第 4 行（因为它们既不包含 "dog" 也不包含 "cat").

谢谢！

Answer 1

使用grep

> data[grep(paste0(remove_if, collapse = "|"), data$phrase, invert = TRUE), ]
  id              phrase
2  2 horse, bunny, mouse
4  4  monkey, chimp, cow

Answer 2

我们可以在'remove_if'中的paste之后使用grepl和subset到单个字符串

subset(data, !grepl(paste(remove_if, collapse="|"), phrase))
#    id              phrase
#2  2 horse, bunny, mouse
#4  4  monkey, chimp, cow

Answer 3

另一种方式（可能不是最好的方式）：

data[-unique(unlist(sapply(c(remove_if),function(x){grep(x,data$phrase)}))),]
  id              phrase
2  2 horse, bunny, mouse
4  4  monkey, chimp, cow

Answer 4

data[!grepl(paste0("(^|, )(", paste0(remove_if, collapse = "|"), ")(,|$)"), data$phrase),]

# id                    phrase
#  2 caterpillar, bunny, mouse
#  4        monkey, chimp, cow

此示例中构造的正则表达式是 "(^|, )(dog|cat)(,|$)"，以避免匹配包含 'cat' 或 'dog' 但实际上不是确切词的词，例如'caterpillar'

Answer 5

如果您想将其与 dplyr 和 stringr 混合使用：

library(stringr)
library(dplyr)

data %>%
  filter(str_detect(phrase, paste(remove_if, collapse = "|"), negate = TRUE))
#   id              phrase
# 1  2 horse, bunny, mouse
# 2  4  monkey, chimp, cow

基于多个字符串的部分匹配的 R 数据帧中的子集行

Subset rows in an R dataframe based on partial match of multiple strings

r

grepl