根据行条件过滤数据框

Question

我想到了以下示例来说明我的问题。

假设有5个球：

红色
蓝色
绿色
黄色
橙色

正常情况下是5个！ = 这些球有 120 种组织方式（n！）。我可以在下面列举所有这些组合：

library(combinat)
library(dplyr)

my_list = c("Red", "Blue", "Green", "Yellow", "Orange")

d = permn(my_list)

all_combinations  = as.data.frame(matrix(unlist(d), ncol = 120)) %>%
  setNames(paste0("col", 1:120))


all_combinations[,1:5]

    col1   col2   col3   col4   col5
1    Red    Red    Red    Red Orange
2   Blue   Blue   Blue Orange    Red
3  Green  Green Orange   Blue   Blue
4 Yellow Orange  Green  Green  Green
5 Orange Yellow Yellow Yellow Yellow

我的问题：

假设我想按以下条件过滤此列表：

“红”球可以在第一或第二位置（从左到右）
“蓝”球和“绿”球之间必须至少有 2 个位置
“黄色”球不能在最后一个位置

然后我尝试根据这 3 个条件过滤上述数据：

# attempt to write first condition
    cond_1 <- all_combinations[which(all_combinations[1,]== "Red" || all_combinations[2,] == "Red"), ]

#not sure how to write the second condition
    
 # attempt to write the third condition   
    cond_3 <- data_frame_version[which(data_frame_version[5,] !== "Yellow" ), ]

# if everything worked, an "anti join" style statement could be written to remove "cond_1, cond_2, cond_3" from the original data?

但这根本不起作用 - 第一个和第三个条件 return 数据框的所有列仅包含 4 行。

有人可以告诉我如何使用上述 3 个过滤器正确过滤“all_combinations”吗？

注：

下面的代码可以转置原始数据：

 library(data.table)

    tpose = transpose(all_combinations)

    df = tpose
    
#group every 5 rows by the same id to identify unique combinations

    bloc_len <- 5
    
    df$bloc <- 
        rep(seq(1, 1 + nrow(df) %/% bloc_len), each = bloc_len, length.out = nrow(df))
    
   
 head(df)

      V1     V2     V3     V4     V5 bloc
1    Red   Blue  Green Yellow Orange    1
2    Red   Blue  Green Orange Yellow    1
3    Red   Blue Orange  Green Yellow    1
4    Red Orange   Blue  Green Yellow    1
5 Orange    Red   Blue  Green Yellow    1
6 Orange    Red   Blue Yellow  Green    2

Answer 1

你可以这样做：

library(tidyverse)
tpose %>%
  mutate(blue_delete = case_when(V1 == "Blue" & V2 == "Green" ~ TRUE,
                                 V1 == "Blue" & V3 == "Green" ~ TRUE,
                                 V2 == "Blue" & V3 == "Green" ~ TRUE,
                                 V3 == "Blue" & V4 == "Green" ~ TRUE,
                                 V4 == "Blue" & V5 == "Green" ~ TRUE,
                                 TRUE ~ FALSE)) %>%
  filter(V3 != "Red" & V4 != "Red" & V5 != "Red",
         V5 != "Yellow",
         blue_delete == FALSE) %>%
  select(-blue_delete)

Answer 2

如果您不太关心 data.frame 结构，我的首选方法是将每个结果保留为列表的成员（即您的 d 变量）和 sapply() 具有检查该结果是否满足所有条件的功能。

观察：

library(combinat)

my_list <- c("Red", "Blue", "Green", "Yellow", "Orange")
my_list_perm <- combinat::permn(my_list) 

# This function examines one particular outcome of the trial, e.g. outcome = ["Blue", "Orange", "Red", "Green", "Yellow"]
test_conditions <- function(outcome) {
  
  # Condition 1
  condition_1 <- "Red" %in% outcome[c(1,2)]
  
  # Condition 2
  condition_2 <- base::abs(base::which(outcome == "Blue") - base::which(outcome == "Green")) >= 2
  
  # Condition 3
  condition_3 <- base::which(outcome == "Yellow") != base::length(outcome)
  
  all <- condition_1 && condition_2 && condition_3
  
  return(all)
}

my_list_matches <- base::which(base::sapply(my_list_perm, test_conditions)) # applies the function to each list element (which itself is an outcome)

print(my_list_matches) # displays which trials / outcomes satisfied all conditions

#>  [1]   6   7   8   9  10  12  19  22  29  31  32  33  34  35  41  48  49  50 111 112 113 120

^{由 reprex package (v1.0.0)}

创建于 2022-01-04

然后您可以使用匹配索引对原始列表进行筛选。

Answer 3

这是一个可扩展的 tidyverse 解决方案。

首先，让我们将数据设为 120 行的小标题，每行对应一个球的组合。

library(tidyverse)
library(combinat)
data = my_list %>% 
  permn() %>%
  map(~ set_names(.x, paste0("ball", 1:5))) %>%
  do.call(bind_rows, args = .) %>%
  mutate(id = row_number())

我们的数据：

# A tibble: 120 x 6
   ball1  ball2  ball3  ball4  ball5     id
   <chr>  <chr>  <chr>  <chr>  <chr>  <int>
 1 Red    Blue   Green  Yellow Orange     1
 2 Red    Blue   Green  Orange Yellow     2
 3 Red    Blue   Orange Green  Yellow     3
 4 Red    Orange Blue   Green  Yellow     4
 5 Orange Red    Blue   Green  Yellow     5
 6 Orange Red    Blue   Yellow Green      6
 7 Red    Orange Blue   Yellow Green      7
 8 Red    Blue   Orange Yellow Green      8
 9 Red    Blue   Yellow Orange Green      9
10 Red    Blue   Yellow Green  Orange    10
# ... with 110 more rows

此解决方案的关键思想是将数据转换为长格式。这将使检查每个条件变得微不足道。之后我们可以将其恢复为宽格式。

data %>%
  pivot_longer(-id) %>%
  mutate(ball_number = as.numeric(str_extract(name, "[1-5]"))) %>%
  group_by(id) %>%
  filter(
    # Condition 1
    ball_number[value == "Red"] %in% c(1, 2),
    # Condition 2
    abs(ball_number[value == "Blue"] - ball_number[value == "Green"]) >= 3,
    # Condition 3
    ball_number[value == "Yellow"] != 5
  ) %>%
  select(-ball_number) %>% 
  pivot_wider(values_from = "value", names_from = "name")

输出显示有10个排列：

# A tibble: 10 x 6
# Groups:   id [10]
      id ball1 ball2 ball3  ball4  ball5 
   <int> <chr> <chr> <chr>  <chr>  <chr> 
 1     8 Red   Blue  Orange Yellow Green 
 2     9 Red   Blue  Yellow Orange Green 
 3    32 Red   Green Yellow Orange Blue  
 4    33 Red   Green Orange Yellow Blue  
 5    48 Green Red   Orange Yellow Blue  
 6    49 Green Red   Yellow Orange Blue  
 7    50 Green Red   Yellow Blue   Orange
 8   111 Blue  Red   Yellow Green  Orange
 9   112 Blue  Red   Yellow Orange Green 
10   113 Blue  Red   Orange Yellow Green

由于我们的变量 ball_number，此解决方案提供的改进是您要检查的所有条件都非常简单。如果有更多球，您可以轻松地将此解决方案扩展到更复杂的条件，例如前 5 个球是红色的，或者蓝色球加上绿色球等于 7。

Answer 4

您可以执行以下操作。我知道这不是您能找到的最漂亮、最优化的解决方案。但它有效！

all_combinations  = as.data.frame(matrix(unlist(d), ncol = 5)) %>%
  setNames(paste0("col", 1:5))

cond_1 <- all_combinations %>%
  filter(col1 == "Red" | col2 == "Red")


cond_2 <- cond_1 %>%
    filter(col1 == "Blue" | col1 == "Green" |
             col2 == "Blue" | col2 == "Green" |
             col3 == "Blue" | col3 == "Green" |
             col4 == "Blue" | col4 == "Green" |
             col5 == "Blue" | col5 == "Green")

cond_2 <- cond_2 %>%
  mutate(cond = ifelse(col1 == 'Blue' & col4 == 'Green', 2, NA) |
           ifelse(col1 == 'Blue' & col5 == 'Green', 3, NA) |
           ifelse(col2 == 'Blue' & col5 == 'Green', 2, NA) |
           ifelse(col1 == 'Green' & col4 == 'Blue', 2, NA) |
           ifelse(col2 == 'Green' & col5 == 'Blue', 3, NA)) %>%
  filter(cond == T)


cond_3 <- cond_2%>%
  filter(col5 != "Yellow")

输出：

  col1 col2   col3  col4 col5 cond
1 Blue  Red Orange Green  Red TRUE

Answer 5

也许我误解了这个问题，但正如我所见，none 的答案似乎显示了一个解决方案，其中在问题的第 2 步中颜色之间有 2 列。

我冒昧地测试了数据，发现只有使用“黄色”和“橙色”，您才能找到符合您要求的过滤条件（据我所知）。

这不是一个笼统的答案，实际上也不正确，因为“黄色”在最后一行，违反了规则但是：

在最后一行已被考虑的情况下，颜色之间的距离为 2 可将问题简化为 4 列问题。所以2的距离只能在第1列和第4列之间实现。这导致 4 个假设：

第 1 列需要是“绿色”或“蓝色”
第 2 列需要为“红色”
第 3 列不应为“绿色”或“蓝色”
第 4 列应该再次是“绿色”或“蓝色”，而不是第 1 列

这是我想出的代码，并不漂亮，正如所解释的那样，“绿色”和“蓝色”切换为“黄色”和“橙色”，但我认为这可行。

library(combinat)
library(tidyverse)

my_list = c("Red", "Blue", "Green", "Yellow", "Orange")

d = permn(my_list)

all_combinations  = as.data.frame(matrix(unlist(d), ncol = 5)) %>%
  setNames(paste0("col", 1:5))

`%!in%` <- Negate(`%in%`)

combis <- all_combinations %>% 
  filter(col1 %in% c("Yellow", "Orange"), 
         col2 == "Red", 
         !col3 %in% c("Yellow", "Orange"), 
         col5 == "Yellow") 

results <- vector()
for(i in seq_along(combis[,1])){
  
  if(combis[i,][1] %!in% c(combis[i,][4], "Red", "Green", "Blue")){
    results <- combis[i,] 
  }
}

results

    col1 col2  col3   col4   col5
3 Yellow  Red Green Orange Yellow

Answer 6

我想如果你考虑每个球的位置而不是每个位置的球，你可能会发现更容易应用这三个条件，如

library(combinat)
my_list = c("Red", "Blue", "Green", "Yellow", "Orange")
d = permn(1:5)
md = matrix(unlist(d), ncol=5, byrow=TRUE)
colnames(md) = my_list

ok = md[, "Red"] <= 2 & 
     abs(md[, "Blue"] - md[, "Green"]) > 2 & 
     md[, "Yellow"] != 5
sum(ok)
# 10
md[ok, ] 
#      Red Blue Green Yellow Orange
# [1,]   1    2     5      3      4
# [2,]   1    5     2      3      4
# [3,]   1    5     2      4      3
# [4,]   1    2     5      4      3
# [5,]   2    4     1      3      5
# [6,]   2    1     4      3      5
# [7,]   2    1     5      4      3
# [8,]   2    5     1      4      3
# [9,]   2    5     1      3      4
#[10,]   2    1     5      3      4

根据行条件过滤数据框

Filtering a Data Frame based on Row Conditions

if-statement

r

rows

data-manipulation