更简洁地为行序列定义索引

Question

我有一个这样的数据框：

set.seed(123)
df <- data.frame(A = sample(LETTERS[1:5], 50, replace = TRUE), 
                 B = sample(LETTERS[1:5], 50, replace = TRUE))

我想根据两个参数过滤数据帧：(i) target 符合特定条件的行和 (ii) 目标行之前的特定行数.具体来说，我想过滤 A == "A" & B == "A" 行以及目标行之前的 5 行。我可以通过两步操作来做到这一点：首先定义一个函数，然后使用该函数作为 slice:

的输入

Sequ <- function(col1, col2) {
  # get row indices of target row with function `which`
  inds <- which(col1 == "A" & col2 == "A") 
  # sort row indices of the rows before target row AND target row itself
  sort(unique(c(inds-5, inds-4, inds-3,inds-2, inds-1, inds)))
}

library(dplyr)
df %>%
  slice(Sequ(col1 = A, col2 = B))
   A B
1  D C
2  D B
3  C B
4  C D
5  B B
6  A A
7  E B
8  E D
9  D C
10 D D
11 A A
12 C C
13 D E
14 B E
15 B E
16 B A
17 A A
18 C D
19 C B
20 B D
21 A B
22 A A

但这部分肯定有更有效的替代品：sort(unique(c(inds-5, inds-4, inds-3,inds-2, inds-1, inds)))。如果我不仅要过滤前面的 5，还要过滤 10 或 100 行单独定义每个索引的方法很快变得不切实际。这部分如何编码更经济？

Answer 1

1) 定义两个 A，它接受一个矩阵，returns 如果任何行都是 A，则为 TRUE。然后使用 rollapply 将其应用为移动 window.

library(zoo)

bothA <- function(x) any(rowSums(rbind(x) == "A") == 2)
ok <- rollapply(df, 6, bothA, align = "left", partial = TRUE, by.column = FALSE)
df[ok, ]

2) 或管道中

df %>% 
  filter(rollapply(., 6, bothA, align = "left", partial = TRUE, by.column = FALSE))

3) 这也有效：

ok <- rollapply(rowSums(df == "A") == 2, 6, any, align = "left", partial = TRUE)
df[ok, ]

Answer 2

这里有一个dplyr解决方案，可以直接在管道中使用，不需要filter。

Sequ <- function(x, col1, col2, value = "A"){
  x %>%
    mutate(grp = lag(cumsum({{col1}} == value & {{col2}} == value), default = 0)) %>%
    group_by(grp) %>%
    slice_tail(n = 5) %>%
    ungroup() %>%
    select(-grp)
}

df %>% Sequ(A, B)
## A tibble: 23 x 2
#   A     B    
#   <chr> <chr>
# 1 B     D    
# 2 C     C    
# 3 E     A    
# 4 D     B    
# 5 A     A    
# 6 C     D    
# 7 E     E    
# 8 C     E    
# 9 C     C    
#10 A     A    
## … with 13 more rows

Answer 3

一个dplyr和purrr的解决方案可以是：

df %>%
 filter(!row_number() %in% unlist(map(which(A == "A" & B == "A"), ~ (.x-5):.x)))

更简洁地为行序列定义索引

Defining indices for row sequences more succintly

r

indices

dplyr