使用逻辑运算符检测 r 中的多个模式?
Detect multiple patterns in r using logical operators?
我正在尝试检测某个模式的组合是否 present/absent 在数据帧的一个变量中。
有一些类似的问题,但我找不到能准确回答我想要达到的目标的问题。
我正在寻找:
- 如果模式存在
- 使用逻辑运算符(and、or、not = $、|、!)定义多个模式
- 忽略大小写
- return 输出为另一列 true/false
我仍然找不到解决方法,但我会分享我到目前为止所做的,以获得您的指导:
创建示例数据框
x=structure(list(Sources = structure(c(1L, 7L, 6L, 8L, 9L, 4L,
3L, 5L, 2L), .Label =
c("Found in all nutritious foods in moderate amounts: pork, whole grain foods or enriched breads and cereals, legumes, nuts and seeds",
"Found only in fruits and vegetables, especially citrus fruits, vegetables in the cabbage family, cantaloupe, strawberries, peppers, tomatoes, potatoes, lettuce, papayas, mangoes, kiwifruit",
"Leafy green vegetables and legumes, seeds, orange juice, and liver; now added to most refined grains",
"Meat, fish, poultry, vegetables, fruits",
"Meat, poultry, fish, seafood, eggs, milk and milk products; not found in plant foods",
"Meat, poultry, fish, whole grain foods, enriched breads and cereals, vegetables (especially mushrooms, asparagus, and leafy green vegetables), peanut butter",
"Milk and milk products; leafy green vegetables; whole grain foods, enriched breads and cereals",
"Widespread in foods", "Widespread in foods; also produced in intestinal tract by bacteria"
), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
此代码检测是否存在 2 个指定字符串中的任何一个
(?i) 表示忽略大小写。
x$present = str_detect(x$Sources, "(?i)Vegetables|(?i)Meat")
# but it does not work with "and"
x$present =str_detect(x$Sources, "(?i)Vegetables&(?i)Meat")
#here it gives FALSE for all, my expected output is to return TRUE for those that contain both words
这个通过过滤所需的组合来工作:
- 它适用于 | & !
- 但它只过滤感兴趣的行,如果模式存在,是否可以通过 true 将另一列添加到数据集?
x %>% filter (str_detect(x$Sources, "(?i)Vegetables") & str_detect(x$Sources, "(?i)Meat"))
x %>% filter (str_detect(x$Sources, "(?i)Vegetables") & !str_detect(x$Sources, "(?i)Meat")) #does not contain meat
x %>% filter (!str_detect(x$Sources, "(?i)Meat") & str_detect(x$Sources, "(?i)Vegetables") & str_detect(x$Sources, "(?i)Grain"))
最后,我发现这个包看起来可以完成这项工作,但它只适用于向量,有没有办法让它适用于数据框中的变量?就像使用 lapply 或 return 另一个变量 True/False?
library(sjmisc)
str_contains(x$Sources, "Meat", ignore.case = T)
使用 mutate
和 str_detect
创建新列:
library(tidyverse)
x %>%
mutate(pattern_detected =
str_detect(Sources, "(?i)Vegetables") &
str_detect(Sources, "(?i)Meat"))
在 data.frame 上使用 sjmisc
包中的函数。这里的主力是 sapply
两次 - 一次用于 data.frame 中的列,一次用于行。
library(sjmisc)
# build dummy data.frame
df <- data.frame(x, x, x)
sapply(df, function(x) sapply(x,
str_contains,
pattern = c("Meat", "Vegetables"),
logic = "and", ignore.case = TRUE))
Sources Sources.1 Sources.2
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] TRUE TRUE TRUE
[4,] FALSE FALSE FALSE
[5,] FALSE FALSE FALSE
[6,] TRUE TRUE TRUE
[7,] FALSE FALSE FALSE
[8,] FALSE FALSE FALSE
[9,] FALSE FALSE FALSE
输出是一个矩阵。如果您想要 data.frame,请将其包装在 as.data.frame.
中
as.data.frame(sapply(df, function(x) sapply(x,
str_contains,
pattern = c("Meat", "Vegetables"),
logic = "and", ignore.case = TRUE)))
Sources Sources.1 Sources.2
1 FALSE FALSE FALSE
2 FALSE FALSE FALSE
3 TRUE TRUE TRUE
4 FALSE FALSE FALSE
5 FALSE FALSE FALSE
6 TRUE TRUE TRUE
7 FALSE FALSE FALSE
8 FALSE FALSE FALSE
9 FALSE FALSE FALSE
我正在尝试检测某个模式的组合是否 present/absent 在数据帧的一个变量中。
有一些类似的问题,但我找不到能准确回答我想要达到的目标的问题。
我正在寻找:
- 如果模式存在
- 使用逻辑运算符(and、or、not = $、|、!)定义多个模式
- 忽略大小写
- return 输出为另一列 true/false
我仍然找不到解决方法,但我会分享我到目前为止所做的,以获得您的指导:
创建示例数据框
x=structure(list(Sources = structure(c(1L, 7L, 6L, 8L, 9L, 4L,
3L, 5L, 2L), .Label =
c("Found in all nutritious foods in moderate amounts: pork, whole grain foods or enriched breads and cereals, legumes, nuts and seeds",
"Found only in fruits and vegetables, especially citrus fruits, vegetables in the cabbage family, cantaloupe, strawberries, peppers, tomatoes, potatoes, lettuce, papayas, mangoes, kiwifruit",
"Leafy green vegetables and legumes, seeds, orange juice, and liver; now added to most refined grains",
"Meat, fish, poultry, vegetables, fruits",
"Meat, poultry, fish, seafood, eggs, milk and milk products; not found in plant foods",
"Meat, poultry, fish, whole grain foods, enriched breads and cereals, vegetables (especially mushrooms, asparagus, and leafy green vegetables), peanut butter",
"Milk and milk products; leafy green vegetables; whole grain foods, enriched breads and cereals",
"Widespread in foods", "Widespread in foods; also produced in intestinal tract by bacteria"
), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
此代码检测是否存在 2 个指定字符串中的任何一个 (?i) 表示忽略大小写。
x$present = str_detect(x$Sources, "(?i)Vegetables|(?i)Meat")
# but it does not work with "and"
x$present =str_detect(x$Sources, "(?i)Vegetables&(?i)Meat")
#here it gives FALSE for all, my expected output is to return TRUE for those that contain both words
这个通过过滤所需的组合来工作:
- 它适用于 | & !
- 但它只过滤感兴趣的行,如果模式存在,是否可以通过 true 将另一列添加到数据集?
x %>% filter (str_detect(x$Sources, "(?i)Vegetables") & str_detect(x$Sources, "(?i)Meat"))
x %>% filter (str_detect(x$Sources, "(?i)Vegetables") & !str_detect(x$Sources, "(?i)Meat")) #does not contain meat
x %>% filter (!str_detect(x$Sources, "(?i)Meat") & str_detect(x$Sources, "(?i)Vegetables") & str_detect(x$Sources, "(?i)Grain"))
最后,我发现这个包看起来可以完成这项工作,但它只适用于向量,有没有办法让它适用于数据框中的变量?就像使用 lapply 或 return 另一个变量 True/False?
library(sjmisc)
str_contains(x$Sources, "Meat", ignore.case = T)
使用 mutate
和 str_detect
创建新列:
library(tidyverse)
x %>%
mutate(pattern_detected =
str_detect(Sources, "(?i)Vegetables") &
str_detect(Sources, "(?i)Meat"))
在 data.frame 上使用 sjmisc
包中的函数。这里的主力是 sapply
两次 - 一次用于 data.frame 中的列,一次用于行。
library(sjmisc)
# build dummy data.frame
df <- data.frame(x, x, x)
sapply(df, function(x) sapply(x,
str_contains,
pattern = c("Meat", "Vegetables"),
logic = "and", ignore.case = TRUE))
Sources Sources.1 Sources.2
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] TRUE TRUE TRUE
[4,] FALSE FALSE FALSE
[5,] FALSE FALSE FALSE
[6,] TRUE TRUE TRUE
[7,] FALSE FALSE FALSE
[8,] FALSE FALSE FALSE
[9,] FALSE FALSE FALSE
输出是一个矩阵。如果您想要 data.frame,请将其包装在 as.data.frame.
中as.data.frame(sapply(df, function(x) sapply(x,
str_contains,
pattern = c("Meat", "Vegetables"),
logic = "and", ignore.case = TRUE)))
Sources Sources.1 Sources.2
1 FALSE FALSE FALSE
2 FALSE FALSE FALSE
3 TRUE TRUE TRUE
4 FALSE FALSE FALSE
5 FALSE FALSE FALSE
6 TRUE TRUE TRUE
7 FALSE FALSE FALSE
8 FALSE FALSE FALSE
9 FALSE FALSE FALSE