检查和审查 R 中组内的先前分组值

Checking and reviewing previous grouped values within groups in R

大家好我希望你们度过了愉快的一周。

我有一个包含 4 个变量的小数据集,一个是 subject,第二个是 key,这是一个主题用来登录系统的代码,第三个是 order,它是将跟踪按时间顺序排列的年份,最后是变量 Period,它指示密钥是在之前的时间 past 还是当前的月份 current.

这是数据集:

subjects<-c(rep("James",3),
            rep("Alex",2),
            rep("Mila",8),
            rep("Mark",1))

keys<-c(rep("IX08-8",2),"IX08-8",
        "UX-007","HH-011",rep("PO_85",7),"UJ_8","785_PO")
order<-c(1:14)
period<-c("past","past","current","past","current",rep("past",6),"current","current","current")
df<-cbind(subjects,keys,period,order)  

> head(df)
     subjects keys     period    order
[1,] "James"  "IX08-8" "past"    "1"  
[2,] "James"  "IX08-8" "past"    "2"  
[3,] "James"  "IX08-8" "current" "3"  
[4,] "Alex"   "UX-007" "past"    "4"  
[5,] "Alex"   "HH-011" "current" "5"  
[6,] "Mila"   "PO_85"  "past"    "6" 

最终我必须能够判断受试者是否使用以前使用过的 keycurrent 期间登录系统,如果 subject 使用新的key 在当前 period 登录系统然后我将值“1”分配给名为 result 的列,如果用户没有使用以前使用的 key 在当前 period 期间登录系统,分配的值应为“0”,否则为“NA”。

我想要的输出如下所示:

      subjects keys     period    order result
 [1,] "James"  "IX08-8" "past"    "1"   NA    
 [2,] "James"  "IX08-8" "past"    "2"   NA    
 [3,] "James"  "IX08-8" "current" "3"   "0"   
 [4,] "Alex"   "UX-007" "past"    "4"   NA    
 [5,] "Alex"   "HH-011" "current" "5"   "1"   
 [6,] "Mila"   "PO_85"  "past"    "6"   NA    
 [7,] "Mila"   "PO_85"  "past"    "7"   NA    
 [8,] "Mila"   "PO_85"  "past"    "8"   NA    
 [9,] "Mila"   "PO_85"  "past"    "9"   NA    
[10,] "Mila"   "PO_85"  "past"    "10"  NA    
[11,] "Mila"   "PO_85"  "past"    "11"  NA    
[12,] "Mila"   "PO_85"  "current" "12"  "0"   
[13,] "Mila"   "UJ_8"   "current" "13"  "1"   
[14,] "Mark"   "785_PO" "current" "14"  "1"

例如,在第 3 行中,James 在结果中指定了值 0,因为他在当月使用了以前使用的密钥登录系统,即密钥“IX08-8”,但 Mark 有一个结果列中的值为 1,因为系统只跟踪了一个密钥,而这恰好是他用来登录当前期间的密钥,从技术上讲,这是一个“新密钥”。

我做了什么来解决这个问题?

我可以按 subject 对数据集进行分组,并确保按 order 降序排列,但我只能考虑创建一个键向量 (vector.of.previous.keys ) 每个主题基于 (period="past") 然后评估当前键是否是 %in% vector.of.previous.keys,但是如果有一种方法可以只检查组内的这个标准它会更多高效的。非常感谢你们的帮助。

假设您的数据存储在 data.frame

df <- data.frame(subjects,keys,period,order)

你可以使用

library(dplyr)

df %>% 
  group_by(subjects, keys) %>% 
  mutate(count = row_number()) %>% 
  group_by(subjects) %>% 
  mutate(result = case_when(period == "current" & count == 1 ~ 1,
                            period == "current" & count >= 1 ~ 0,
                            TRUE ~ NA_real_)) %>% 
  ungroup() %>% 
  select(-count)

获得

# A tibble: 14 x 5
   subjects keys   period  order result
   <chr>    <chr>  <chr>   <int>  <dbl>
 1 James    IX08-8 past        1     NA
 2 James    IX08-8 past        2     NA
 3 James    IX08-8 current     3      0
 4 Alex     UX-007 past        4     NA
 5 Alex     HH-011 current     5      1
 6 Mila     PO_85  past        6     NA
 7 Mila     PO_85  past        7     NA
 8 Mila     PO_85  past        8     NA
 9 Mila     PO_85  past        9     NA
10 Mila     PO_85  past       10     NA
11 Mila     PO_85  past       11     NA
12 Mila     PO_85  current    12      0
13 Mila     UJ_8   current    13      1
14 Mark     785_PO current    14      1

另一种 dplyr 方法,您的所有条件都在 case_when 语句中编码。

代码

library(dplyr)

df %>% 
  group_by(subjects) %>% 
  mutate(result = case_when(period == "current" & n() == 1 ~ "1",
                            period == "current" & keys == first(keys) ~ "0",
                            period == "current" & keys != first(keys) & n() > 1 ~ "1",
                            period == "past" ~ NA_character_,
                            TRUE == "past" ~ NA_character_))
# A tibble: 14 × 5
# Groups:   subjects [4]
   subjects keys   period  order result
   <chr>    <chr>  <chr>   <int> <chr> 
 1 James    IX08-8 past        1 NA    
 2 James    IX08-8 past        2 NA    
 3 James    IX08-8 current     3 0     
 4 Alex     UX-007 past        4 NA    
 5 Alex     HH-011 current     5 1     
 6 Mila     PO_85  past        6 NA    
 7 Mila     PO_85  past        7 NA    
 8 Mila     PO_85  past        8 NA    
 9 Mila     PO_85  past        9 NA    
10 Mila     PO_85  past       10 NA    
11 Mila     PO_85  past       11 NA    
12 Mila     PO_85  current    12 0     
13 Mila     UJ_8   current    13 1     
14 Mark     785_PO current    14 1    

数据

请注意,我已将您的 cbind() 更改为 data.frame(与矩阵相比,数据框更易于处理)。

subjects<-c(rep("James",3),
            rep("Alex",2),
            rep("Mila",8),
            rep("Mark",1))

keys<-c(rep("IX08-8",2),"IX08-8",
        "UX-007","HH-011",rep("PO_85",7),"UJ_8","785_PO")
order<-c(1:14)
period<-c("past","past","current","past","current",rep("past",6),"current","current","current")
df<-data.frame(subjects,keys,period,order)