根据行条件过滤数据框
Filtering a Data Frame based on Row Conditions
我想到了以下示例来说明我的问题。
假设有5个球:
- 红色
- 蓝色
- 绿色
- 黄色
- 橙色
正常情况下是5个! = 这些球有 120 种组织方式(n!)。我可以在下面列举所有这些组合:
library(combinat)
library(dplyr)
my_list = c("Red", "Blue", "Green", "Yellow", "Orange")
d = permn(my_list)
all_combinations = as.data.frame(matrix(unlist(d), ncol = 120)) %>%
setNames(paste0("col", 1:120))
all_combinations[,1:5]
col1 col2 col3 col4 col5
1 Red Red Red Red Orange
2 Blue Blue Blue Orange Red
3 Green Green Orange Blue Blue
4 Yellow Orange Green Green Green
5 Orange Yellow Yellow Yellow Yellow
我的问题:
假设我想按以下条件过滤此列表:
- “红”球可以在第一或第二位置(从左到右)
- “蓝”球和“绿”球之间必须至少有 2 个位置
- “黄色”球不能在最后一个位置
然后我尝试根据这 3 个条件过滤上述数据:
# attempt to write first condition
cond_1 <- all_combinations[which(all_combinations[1,]== "Red" || all_combinations[2,] == "Red"), ]
#not sure how to write the second condition
# attempt to write the third condition
cond_3 <- data_frame_version[which(data_frame_version[5,] !== "Yellow" ), ]
# if everything worked, an "anti join" style statement could be written to remove "cond_1, cond_2, cond_3" from the original data?
但这根本不起作用 - 第一个和第三个条件 return 数据框的所有列仅包含 4 行。
有人可以告诉我如何使用上述 3 个过滤器正确过滤“all_combinations”吗?
注:
下面的代码可以转置原始数据:
library(data.table)
tpose = transpose(all_combinations)
df = tpose
#group every 5 rows by the same id to identify unique combinations
bloc_len <- 5
df$bloc <-
rep(seq(1, 1 + nrow(df) %/% bloc_len), each = bloc_len, length.out = nrow(df))
head(df)
V1 V2 V3 V4 V5 bloc
1 Red Blue Green Yellow Orange 1
2 Red Blue Green Orange Yellow 1
3 Red Blue Orange Green Yellow 1
4 Red Orange Blue Green Yellow 1
5 Orange Red Blue Green Yellow 1
6 Orange Red Blue Yellow Green 2
你可以这样做:
library(tidyverse)
tpose %>%
mutate(blue_delete = case_when(V1 == "Blue" & V2 == "Green" ~ TRUE,
V1 == "Blue" & V3 == "Green" ~ TRUE,
V2 == "Blue" & V3 == "Green" ~ TRUE,
V3 == "Blue" & V4 == "Green" ~ TRUE,
V4 == "Blue" & V5 == "Green" ~ TRUE,
TRUE ~ FALSE)) %>%
filter(V3 != "Red" & V4 != "Red" & V5 != "Red",
V5 != "Yellow",
blue_delete == FALSE) %>%
select(-blue_delete)
如果您不太关心 data.frame
结构,我的首选方法是将每个结果保留为列表的成员(即您的 d
变量)和 sapply()
具有检查该结果是否满足所有条件的功能。
观察:
library(combinat)
my_list <- c("Red", "Blue", "Green", "Yellow", "Orange")
my_list_perm <- combinat::permn(my_list)
# This function examines one particular outcome of the trial, e.g. outcome = ["Blue", "Orange", "Red", "Green", "Yellow"]
test_conditions <- function(outcome) {
# Condition 1
condition_1 <- "Red" %in% outcome[c(1,2)]
# Condition 2
condition_2 <- base::abs(base::which(outcome == "Blue") - base::which(outcome == "Green")) >= 2
# Condition 3
condition_3 <- base::which(outcome == "Yellow") != base::length(outcome)
all <- condition_1 && condition_2 && condition_3
return(all)
}
my_list_matches <- base::which(base::sapply(my_list_perm, test_conditions)) # applies the function to each list element (which itself is an outcome)
print(my_list_matches) # displays which trials / outcomes satisfied all conditions
#> [1] 6 7 8 9 10 12 19 22 29 31 32 33 34 35 41 48 49 50 111 112 113 120
由 reprex package (v1.0.0)
创建于 2022-01-04
然后您可以使用匹配索引对原始列表进行筛选。
这是一个可扩展的 tidyverse 解决方案。
首先,让我们将数据设为 120 行的小标题,每行对应一个球的组合。
library(tidyverse)
library(combinat)
data = my_list %>%
permn() %>%
map(~ set_names(.x, paste0("ball", 1:5))) %>%
do.call(bind_rows, args = .) %>%
mutate(id = row_number())
我们的数据:
# A tibble: 120 x 6
ball1 ball2 ball3 ball4 ball5 id
<chr> <chr> <chr> <chr> <chr> <int>
1 Red Blue Green Yellow Orange 1
2 Red Blue Green Orange Yellow 2
3 Red Blue Orange Green Yellow 3
4 Red Orange Blue Green Yellow 4
5 Orange Red Blue Green Yellow 5
6 Orange Red Blue Yellow Green 6
7 Red Orange Blue Yellow Green 7
8 Red Blue Orange Yellow Green 8
9 Red Blue Yellow Orange Green 9
10 Red Blue Yellow Green Orange 10
# ... with 110 more rows
此解决方案的关键思想是将数据转换为长格式。这将使检查每个条件变得微不足道。之后我们可以将其恢复为宽格式。
data %>%
pivot_longer(-id) %>%
mutate(ball_number = as.numeric(str_extract(name, "[1-5]"))) %>%
group_by(id) %>%
filter(
# Condition 1
ball_number[value == "Red"] %in% c(1, 2),
# Condition 2
abs(ball_number[value == "Blue"] - ball_number[value == "Green"]) >= 3,
# Condition 3
ball_number[value == "Yellow"] != 5
) %>%
select(-ball_number) %>%
pivot_wider(values_from = "value", names_from = "name")
输出显示有10个排列:
# A tibble: 10 x 6
# Groups: id [10]
id ball1 ball2 ball3 ball4 ball5
<int> <chr> <chr> <chr> <chr> <chr>
1 8 Red Blue Orange Yellow Green
2 9 Red Blue Yellow Orange Green
3 32 Red Green Yellow Orange Blue
4 33 Red Green Orange Yellow Blue
5 48 Green Red Orange Yellow Blue
6 49 Green Red Yellow Orange Blue
7 50 Green Red Yellow Blue Orange
8 111 Blue Red Yellow Green Orange
9 112 Blue Red Yellow Orange Green
10 113 Blue Red Orange Yellow Green
由于我们的变量 ball_number
,此解决方案提供的改进是您要检查的所有条件都非常简单。如果有更多球,您可以轻松地将此解决方案扩展到更复杂的条件,例如前 5 个球是红色的,或者蓝色球加上绿色球等于 7。
您可以执行以下操作。我知道这不是您能找到的最漂亮、最优化的解决方案。但它有效!
all_combinations = as.data.frame(matrix(unlist(d), ncol = 5)) %>%
setNames(paste0("col", 1:5))
cond_1 <- all_combinations %>%
filter(col1 == "Red" | col2 == "Red")
cond_2 <- cond_1 %>%
filter(col1 == "Blue" | col1 == "Green" |
col2 == "Blue" | col2 == "Green" |
col3 == "Blue" | col3 == "Green" |
col4 == "Blue" | col4 == "Green" |
col5 == "Blue" | col5 == "Green")
cond_2 <- cond_2 %>%
mutate(cond = ifelse(col1 == 'Blue' & col4 == 'Green', 2, NA) |
ifelse(col1 == 'Blue' & col5 == 'Green', 3, NA) |
ifelse(col2 == 'Blue' & col5 == 'Green', 2, NA) |
ifelse(col1 == 'Green' & col4 == 'Blue', 2, NA) |
ifelse(col2 == 'Green' & col5 == 'Blue', 3, NA)) %>%
filter(cond == T)
cond_3 <- cond_2%>%
filter(col5 != "Yellow")
输出:
col1 col2 col3 col4 col5 cond
1 Blue Red Orange Green Red TRUE
也许我误解了这个问题,但正如我所见,none 的答案似乎显示了一个解决方案,其中在问题的第 2 步中颜色之间有 2 列。
我冒昧地测试了数据,发现只有使用“黄色”和“橙色”,您才能找到符合您要求的过滤条件(据我所知)。
这不是一个笼统的答案,实际上也不正确,因为“黄色”在最后一行,违反了规则但是:
在最后一行已被考虑的情况下,颜色之间的距离为 2 可将问题简化为 4 列问题。所以2的距离只能在第1列和第4列之间实现。
这导致 4 个假设:
第 1 列需要是“绿色”或“蓝色”
第 2 列需要为“红色”
第 3 列不应为“绿色”或“蓝色”
第 4 列应该再次是“绿色”或“蓝色”,而不是第 1 列
这是我想出的代码,并不漂亮,正如所解释的那样,“绿色”和“蓝色”切换为“黄色”和“橙色”,但我认为这可行。
library(combinat)
library(tidyverse)
my_list = c("Red", "Blue", "Green", "Yellow", "Orange")
d = permn(my_list)
all_combinations = as.data.frame(matrix(unlist(d), ncol = 5)) %>%
setNames(paste0("col", 1:5))
`%!in%` <- Negate(`%in%`)
combis <- all_combinations %>%
filter(col1 %in% c("Yellow", "Orange"),
col2 == "Red",
!col3 %in% c("Yellow", "Orange"),
col5 == "Yellow")
results <- vector()
for(i in seq_along(combis[,1])){
if(combis[i,][1] %!in% c(combis[i,][4], "Red", "Green", "Blue")){
results <- combis[i,]
}
}
results
col1 col2 col3 col4 col5
3 Yellow Red Green Orange Yellow
我想如果你考虑每个球的位置而不是每个位置的球,你可能会发现更容易应用这三个条件,如
library(combinat)
my_list = c("Red", "Blue", "Green", "Yellow", "Orange")
d = permn(1:5)
md = matrix(unlist(d), ncol=5, byrow=TRUE)
colnames(md) = my_list
ok = md[, "Red"] <= 2 &
abs(md[, "Blue"] - md[, "Green"]) > 2 &
md[, "Yellow"] != 5
sum(ok)
# 10
md[ok, ]
# Red Blue Green Yellow Orange
# [1,] 1 2 5 3 4
# [2,] 1 5 2 3 4
# [3,] 1 5 2 4 3
# [4,] 1 2 5 4 3
# [5,] 2 4 1 3 5
# [6,] 2 1 4 3 5
# [7,] 2 1 5 4 3
# [8,] 2 5 1 4 3
# [9,] 2 5 1 3 4
#[10,] 2 1 5 3 4
我想到了以下示例来说明我的问题。
假设有5个球:
- 红色
- 蓝色
- 绿色
- 黄色
- 橙色
正常情况下是5个! = 这些球有 120 种组织方式(n!)。我可以在下面列举所有这些组合:
library(combinat)
library(dplyr)
my_list = c("Red", "Blue", "Green", "Yellow", "Orange")
d = permn(my_list)
all_combinations = as.data.frame(matrix(unlist(d), ncol = 120)) %>%
setNames(paste0("col", 1:120))
all_combinations[,1:5]
col1 col2 col3 col4 col5
1 Red Red Red Red Orange
2 Blue Blue Blue Orange Red
3 Green Green Orange Blue Blue
4 Yellow Orange Green Green Green
5 Orange Yellow Yellow Yellow Yellow
我的问题:
假设我想按以下条件过滤此列表:
- “红”球可以在第一或第二位置(从左到右)
- “蓝”球和“绿”球之间必须至少有 2 个位置
- “黄色”球不能在最后一个位置
然后我尝试根据这 3 个条件过滤上述数据:
# attempt to write first condition
cond_1 <- all_combinations[which(all_combinations[1,]== "Red" || all_combinations[2,] == "Red"), ]
#not sure how to write the second condition
# attempt to write the third condition
cond_3 <- data_frame_version[which(data_frame_version[5,] !== "Yellow" ), ]
# if everything worked, an "anti join" style statement could be written to remove "cond_1, cond_2, cond_3" from the original data?
但这根本不起作用 - 第一个和第三个条件 return 数据框的所有列仅包含 4 行。
有人可以告诉我如何使用上述 3 个过滤器正确过滤“all_combinations”吗?
注:
下面的代码可以转置原始数据:
library(data.table)
tpose = transpose(all_combinations)
df = tpose
#group every 5 rows by the same id to identify unique combinations
bloc_len <- 5
df$bloc <-
rep(seq(1, 1 + nrow(df) %/% bloc_len), each = bloc_len, length.out = nrow(df))
head(df)
V1 V2 V3 V4 V5 bloc
1 Red Blue Green Yellow Orange 1
2 Red Blue Green Orange Yellow 1
3 Red Blue Orange Green Yellow 1
4 Red Orange Blue Green Yellow 1
5 Orange Red Blue Green Yellow 1
6 Orange Red Blue Yellow Green 2
你可以这样做:
library(tidyverse)
tpose %>%
mutate(blue_delete = case_when(V1 == "Blue" & V2 == "Green" ~ TRUE,
V1 == "Blue" & V3 == "Green" ~ TRUE,
V2 == "Blue" & V3 == "Green" ~ TRUE,
V3 == "Blue" & V4 == "Green" ~ TRUE,
V4 == "Blue" & V5 == "Green" ~ TRUE,
TRUE ~ FALSE)) %>%
filter(V3 != "Red" & V4 != "Red" & V5 != "Red",
V5 != "Yellow",
blue_delete == FALSE) %>%
select(-blue_delete)
如果您不太关心 data.frame
结构,我的首选方法是将每个结果保留为列表的成员(即您的 d
变量)和 sapply()
具有检查该结果是否满足所有条件的功能。
观察:
library(combinat)
my_list <- c("Red", "Blue", "Green", "Yellow", "Orange")
my_list_perm <- combinat::permn(my_list)
# This function examines one particular outcome of the trial, e.g. outcome = ["Blue", "Orange", "Red", "Green", "Yellow"]
test_conditions <- function(outcome) {
# Condition 1
condition_1 <- "Red" %in% outcome[c(1,2)]
# Condition 2
condition_2 <- base::abs(base::which(outcome == "Blue") - base::which(outcome == "Green")) >= 2
# Condition 3
condition_3 <- base::which(outcome == "Yellow") != base::length(outcome)
all <- condition_1 && condition_2 && condition_3
return(all)
}
my_list_matches <- base::which(base::sapply(my_list_perm, test_conditions)) # applies the function to each list element (which itself is an outcome)
print(my_list_matches) # displays which trials / outcomes satisfied all conditions
#> [1] 6 7 8 9 10 12 19 22 29 31 32 33 34 35 41 48 49 50 111 112 113 120
由 reprex package (v1.0.0)
创建于 2022-01-04然后您可以使用匹配索引对原始列表进行筛选。
这是一个可扩展的 tidyverse 解决方案。
首先,让我们将数据设为 120 行的小标题,每行对应一个球的组合。
library(tidyverse)
library(combinat)
data = my_list %>%
permn() %>%
map(~ set_names(.x, paste0("ball", 1:5))) %>%
do.call(bind_rows, args = .) %>%
mutate(id = row_number())
我们的数据:
# A tibble: 120 x 6
ball1 ball2 ball3 ball4 ball5 id
<chr> <chr> <chr> <chr> <chr> <int>
1 Red Blue Green Yellow Orange 1
2 Red Blue Green Orange Yellow 2
3 Red Blue Orange Green Yellow 3
4 Red Orange Blue Green Yellow 4
5 Orange Red Blue Green Yellow 5
6 Orange Red Blue Yellow Green 6
7 Red Orange Blue Yellow Green 7
8 Red Blue Orange Yellow Green 8
9 Red Blue Yellow Orange Green 9
10 Red Blue Yellow Green Orange 10
# ... with 110 more rows
此解决方案的关键思想是将数据转换为长格式。这将使检查每个条件变得微不足道。之后我们可以将其恢复为宽格式。
data %>%
pivot_longer(-id) %>%
mutate(ball_number = as.numeric(str_extract(name, "[1-5]"))) %>%
group_by(id) %>%
filter(
# Condition 1
ball_number[value == "Red"] %in% c(1, 2),
# Condition 2
abs(ball_number[value == "Blue"] - ball_number[value == "Green"]) >= 3,
# Condition 3
ball_number[value == "Yellow"] != 5
) %>%
select(-ball_number) %>%
pivot_wider(values_from = "value", names_from = "name")
输出显示有10个排列:
# A tibble: 10 x 6
# Groups: id [10]
id ball1 ball2 ball3 ball4 ball5
<int> <chr> <chr> <chr> <chr> <chr>
1 8 Red Blue Orange Yellow Green
2 9 Red Blue Yellow Orange Green
3 32 Red Green Yellow Orange Blue
4 33 Red Green Orange Yellow Blue
5 48 Green Red Orange Yellow Blue
6 49 Green Red Yellow Orange Blue
7 50 Green Red Yellow Blue Orange
8 111 Blue Red Yellow Green Orange
9 112 Blue Red Yellow Orange Green
10 113 Blue Red Orange Yellow Green
由于我们的变量 ball_number
,此解决方案提供的改进是您要检查的所有条件都非常简单。如果有更多球,您可以轻松地将此解决方案扩展到更复杂的条件,例如前 5 个球是红色的,或者蓝色球加上绿色球等于 7。
您可以执行以下操作。我知道这不是您能找到的最漂亮、最优化的解决方案。但它有效!
all_combinations = as.data.frame(matrix(unlist(d), ncol = 5)) %>%
setNames(paste0("col", 1:5))
cond_1 <- all_combinations %>%
filter(col1 == "Red" | col2 == "Red")
cond_2 <- cond_1 %>%
filter(col1 == "Blue" | col1 == "Green" |
col2 == "Blue" | col2 == "Green" |
col3 == "Blue" | col3 == "Green" |
col4 == "Blue" | col4 == "Green" |
col5 == "Blue" | col5 == "Green")
cond_2 <- cond_2 %>%
mutate(cond = ifelse(col1 == 'Blue' & col4 == 'Green', 2, NA) |
ifelse(col1 == 'Blue' & col5 == 'Green', 3, NA) |
ifelse(col2 == 'Blue' & col5 == 'Green', 2, NA) |
ifelse(col1 == 'Green' & col4 == 'Blue', 2, NA) |
ifelse(col2 == 'Green' & col5 == 'Blue', 3, NA)) %>%
filter(cond == T)
cond_3 <- cond_2%>%
filter(col5 != "Yellow")
输出:
col1 col2 col3 col4 col5 cond
1 Blue Red Orange Green Red TRUE
也许我误解了这个问题,但正如我所见,none 的答案似乎显示了一个解决方案,其中在问题的第 2 步中颜色之间有 2 列。
我冒昧地测试了数据,发现只有使用“黄色”和“橙色”,您才能找到符合您要求的过滤条件(据我所知)。
这不是一个笼统的答案,实际上也不正确,因为“黄色”在最后一行,违反了规则但是:
在最后一行已被考虑的情况下,颜色之间的距离为 2 可将问题简化为 4 列问题。所以2的距离只能在第1列和第4列之间实现。 这导致 4 个假设:
第 1 列需要是“绿色”或“蓝色”
第 2 列需要为“红色”
第 3 列不应为“绿色”或“蓝色”
第 4 列应该再次是“绿色”或“蓝色”,而不是第 1 列
这是我想出的代码,并不漂亮,正如所解释的那样,“绿色”和“蓝色”切换为“黄色”和“橙色”,但我认为这可行。
library(combinat)
library(tidyverse)
my_list = c("Red", "Blue", "Green", "Yellow", "Orange")
d = permn(my_list)
all_combinations = as.data.frame(matrix(unlist(d), ncol = 5)) %>%
setNames(paste0("col", 1:5))
`%!in%` <- Negate(`%in%`)
combis <- all_combinations %>%
filter(col1 %in% c("Yellow", "Orange"),
col2 == "Red",
!col3 %in% c("Yellow", "Orange"),
col5 == "Yellow")
results <- vector()
for(i in seq_along(combis[,1])){
if(combis[i,][1] %!in% c(combis[i,][4], "Red", "Green", "Blue")){
results <- combis[i,]
}
}
results
col1 col2 col3 col4 col5
3 Yellow Red Green Orange Yellow
我想如果你考虑每个球的位置而不是每个位置的球,你可能会发现更容易应用这三个条件,如
library(combinat)
my_list = c("Red", "Blue", "Green", "Yellow", "Orange")
d = permn(1:5)
md = matrix(unlist(d), ncol=5, byrow=TRUE)
colnames(md) = my_list
ok = md[, "Red"] <= 2 &
abs(md[, "Blue"] - md[, "Green"]) > 2 &
md[, "Yellow"] != 5
sum(ok)
# 10
md[ok, ]
# Red Blue Green Yellow Orange
# [1,] 1 2 5 3 4
# [2,] 1 5 2 3 4
# [3,] 1 5 2 4 3
# [4,] 1 2 5 4 3
# [5,] 2 4 1 3 5
# [6,] 2 1 4 3 5
# [7,] 2 1 5 4 3
# [8,] 2 5 1 4 3
# [9,] 2 5 1 3 4
#[10,] 2 1 5 3 4