根据第二列中下一个值的索引为列分配编号

Assign Number to Column Based on Index of Next Value in 2nd Column

我有以下数据框:

df <- data.frame(t = c("h","h","h","a","a","h","a","a","h","a","h","a","a"), time = c(1,1,1,1,1,1,1,1,1,1,1,2,2), 
                 key = c("no", "no", "no","yes","no","no","no","no","yes","yes","no","no","no"), 
                 expected = c(-1,-1,-1,1,-1,1,-1,-1,1,1,0,0,0))

   t time key expected myTest1 myTest2 myTest3
1  h    1  no       -1      -1       1       1
2  h    1  no       -1      -1       1       1
3  h    1  no       -1      -1       1       1
4  a    1 yes        1       1       1      -1
5  a    1  no       -1       1       0      -1
6  h    1  no        1      -1       0       1
7  a    1  no       -1       1       0      -1
8  a    1  no       -1       1       0      -1
9  h    1 yes        1      -1       0       1
10 a    1 yes        1       1       1      -1
11 h    1  no        0      -1       0       1
12 a    2  no        0       1       0      -1
13 a    2  no        0       1       0      -1

我正在尝试重新创建类似于 expected 的列。按 time 列分组,第一个条件是在 key 中有“是”的每一行中分配 1。其他条件是:

  1. 如果 key 中包含下一个“yes”的行也包含 t 中的“h”,分配 1 直到每一行的“yes”行具有“h”,对于具有“a”的行 -1
  2. 如果 key 中包含下一个“yes”的行也包含 t 中的“a”,分配 1 直到每一行的“yes”行对于具有“h”的行,具有“a”和 -1
  3. 如果每个 time 部分中没有更多的“是”行,则将 0 分配给该行

我首先尝试使用嵌套 for 循环:

df$myTest1 <- 0
testIdx <- which(df$key %in% "yes")
df$myTest1[testIdx] <- 1
for (i in 1:length(testIdx)) {
  for (j in 1:nrow(df)) {
    df$myTest1[j] <- ifelse(df$t[testIdx[i]] == "h" & df$t[j] == "h", 1,
                                  ifelse(df$t[testIdx[i]] == "h" & df$t[j] == "a", -1,
                                         ifelse(df$t[testIdx[i]] == "a" & df$t[j] == "h",
                                                -1, ifelse(df$t[testIdx[i]] == "a" & 
                                                             df$t[j] == "a", 1, 0))))
  }
}

这会在 myTest1 中获取正确的值,直到并包括第一个“是”,但在不正确之后获取所有行。

我还尝试了另外两种方法来创建 myTest2myTest3:

df$myTest2 <- cumsum(c(1, head(df$key == "yes", -1))) %% 2

df <- df %>%
  mutate(myTest3 = case_when(t == "h" ~ 1, #add if next "yes" is also "h" condition
                               t == "a" ~ -1,
                               TRUE ~ 0))

使用 case_when()ifelse 类似,但我不知道如何在没有 for 循环的情况下添加其他条件。

为澄清起见,expected 列按原样读取,因为第一个“是”属于带有“a”的行,因此所有前面的“h”行得到 -1 而所有先前的“a”行得到 1。下一个“是”行现在有“h”,所以“是”之间的行得到 1 表示“h”,-1 表示“a”。第 10 行包含一个“yes”并且紧跟在“yes”之后,所以它只是得到一个 1。第 11 行是最后一个 time = 1,后面没有“yes”,所以它被赋值为 0。当 time = 2 时没有“是”行,因此那里的所有行也收到 0.

这可能对你有帮助。

魔法发生在 zoo 包中的 na.locf 函数中。

library(magrittr)
library(zoo)

doblock <- function(timeblock) {
  yesrows <- which(timeblock$key == "yes")
  if (length(yesrows) == 0) {
    # no yes rows in timeblock: make all 0
    timeblock$exp2 <- 0
  } else {
    # create a vector of a's and h's against which we need to match the t field
    tomatch <- rep(NA, nrow(timeblock))
    tomatch[yesrows] <- as.character(timeblock$t)[yesrows]
    
    tomatch <- zoo::na.locf(tomatch, fromLast = TRUE)
    
    # now do the matching
    timeblock$exp2 <- 0    # set default as 0 (for those entries after the last 'yes')
    timeblock$exp2[1:length(tomatch)] <-
      mapply(function(t1, t2) {
        if ((t1) == t2) 1 else -1
      }, as.character(timeblock$t[1:length(tomatch)]), tomatch)
  }
  
  timeblock
} 

# split dataframe into blocks for each 'time' and apply function to every time-block
newdf <- 
  lapply(split(df, df$time), doblock) %>% 
  do.call(rbind, .)

结果如下所示,其中 exp2 是上述函数的输出。与您的 expected 字段匹配。

     t time key expected exp2
1.1  h    1  no       -1   -1
1.2  h    1  no       -1   -1
1.3  h    1  no       -1   -1
1.4  a    1 yes        1    1
1.5  a    1  no       -1   -1
1.6  h    1  no        1    1
1.7  a    1  no       -1   -1
1.8  a    1  no       -1   -1
1.9  h    1 yes        1    1
1.10 a    1 yes        1    1
1.11 h    1  no        0    0
2.12 a    2  no        0    0
2.13 a    2  no        0    0