基于多个字符串的部分匹配的 R 数据帧中的子集行
Subset rows in an R dataframe based on partial match of multiple strings
我不认为有人问过这个确切的问题 - 很多关于基于一个值(即 x[grepl("some string", x[["column1"]]),]
)的子集化的东西,但不是多个 values/strings.
这是我的数据示例:
#create sample data frame
data = data.frame(id = c(1,2,3,4), phrase = c("dog, frog, cat, moose", "horse, bunny, mouse", "armadillo, cat, bird,", "monkey, chimp, cow"))
#convert the `phrase` column to character string (the dataset I'm working on requires this)
data$phrase = data$phrase
#list of strings to remove rows by
remove_if = c("dog", "cat")
这将给出一个如下所示的数据集:
id phrase
1 1 dog, frog, cat, moose
2 2 horse, bunny, mouse
3 3 armadillo, cat, bird,
4 4 monkey, chimp, cow
我想删除第 1 行和第 3 行(因为第 1 行包含 "dog",第 3 行包含 "cat"),但保留第 2 行和第 4 行。
id phrase
1 2 horse, bunny, mouse
2 4 monkey, chimp, cow
换句话说,我想对 data
进行子集化,以便它只是(headers 和)第 2 行和第 4 行(因为它们既不包含 "dog" 也不包含 "cat").
谢谢!
使用grep
> data[grep(paste0(remove_if, collapse = "|"), data$phrase, invert = TRUE), ]
id phrase
2 2 horse, bunny, mouse
4 4 monkey, chimp, cow
我们可以在'remove_if'中的paste
之后使用grepl
和subset
到单个字符串
subset(data, !grepl(paste(remove_if, collapse="|"), phrase))
# id phrase
#2 2 horse, bunny, mouse
#4 4 monkey, chimp, cow
另一种方式(可能不是最好的方式):
data[-unique(unlist(sapply(c(remove_if),function(x){grep(x,data$phrase)}))),]
id phrase
2 2 horse, bunny, mouse
4 4 monkey, chimp, cow
data[!grepl(paste0("(^|, )(", paste0(remove_if, collapse = "|"), ")(,|$)"), data$phrase),]
# id phrase
# 2 caterpillar, bunny, mouse
# 4 monkey, chimp, cow
此示例中构造的正则表达式是 "(^|, )(dog|cat)(,|$)"
,以避免匹配包含 'cat' 或 'dog' 但实际上不是确切词的词,例如'caterpillar'
如果您想将其与 dplyr
和 stringr
混合使用:
library(stringr)
library(dplyr)
data %>%
filter(str_detect(phrase, paste(remove_if, collapse = "|"), negate = TRUE))
# id phrase
# 1 2 horse, bunny, mouse
# 2 4 monkey, chimp, cow
我不认为有人问过这个确切的问题 - 很多关于基于一个值(即 x[grepl("some string", x[["column1"]]),]
)的子集化的东西,但不是多个 values/strings.
这是我的数据示例:
#create sample data frame
data = data.frame(id = c(1,2,3,4), phrase = c("dog, frog, cat, moose", "horse, bunny, mouse", "armadillo, cat, bird,", "monkey, chimp, cow"))
#convert the `phrase` column to character string (the dataset I'm working on requires this)
data$phrase = data$phrase
#list of strings to remove rows by
remove_if = c("dog", "cat")
这将给出一个如下所示的数据集:
id phrase
1 1 dog, frog, cat, moose
2 2 horse, bunny, mouse
3 3 armadillo, cat, bird,
4 4 monkey, chimp, cow
我想删除第 1 行和第 3 行(因为第 1 行包含 "dog",第 3 行包含 "cat"),但保留第 2 行和第 4 行。
id phrase
1 2 horse, bunny, mouse
2 4 monkey, chimp, cow
换句话说,我想对 data
进行子集化,以便它只是(headers 和)第 2 行和第 4 行(因为它们既不包含 "dog" 也不包含 "cat").
谢谢!
使用grep
> data[grep(paste0(remove_if, collapse = "|"), data$phrase, invert = TRUE), ]
id phrase
2 2 horse, bunny, mouse
4 4 monkey, chimp, cow
我们可以在'remove_if'中的paste
之后使用grepl
和subset
到单个字符串
subset(data, !grepl(paste(remove_if, collapse="|"), phrase))
# id phrase
#2 2 horse, bunny, mouse
#4 4 monkey, chimp, cow
另一种方式(可能不是最好的方式):
data[-unique(unlist(sapply(c(remove_if),function(x){grep(x,data$phrase)}))),]
id phrase
2 2 horse, bunny, mouse
4 4 monkey, chimp, cow
data[!grepl(paste0("(^|, )(", paste0(remove_if, collapse = "|"), ")(,|$)"), data$phrase),]
# id phrase
# 2 caterpillar, bunny, mouse
# 4 monkey, chimp, cow
此示例中构造的正则表达式是 "(^|, )(dog|cat)(,|$)"
,以避免匹配包含 'cat' 或 'dog' 但实际上不是确切词的词,例如'caterpillar'
如果您想将其与 dplyr
和 stringr
混合使用:
library(stringr)
library(dplyr)
data %>%
filter(str_detect(phrase, paste(remove_if, collapse = "|"), negate = TRUE))
# id phrase
# 1 2 horse, bunny, mouse
# 2 4 monkey, chimp, cow