从数据框中获取嵌套值

obtain nested values from dateframe

我正在尝试获取 event 列中的最大值,直到达到 agreement(虚拟);事件嵌套在协议中,协议嵌套在 dyad 中 运行 超过 year。请注意,年份并不总是连续的,这意味着年份之间存在中断(1986、1987、2001、2002)。

我能够使用 ddply 和 max(event) 获得二元组中的最大值;但我很难将不同的事件“分配”给正确的协议 (until/after)。我基本上缺少一个 'identifier' 将每个观察结果分配给一个协议。

我要找的结果已经在"result"栏中了。

dyad    year    event   agreement   agreement.name  result  
  1     1985    9           
  1     1986    4       1           agreement1       9 
  1     1987    
  1     2001    3       
  1     2002            1           agreement2       3
  2     1999    1       
  2     2000    5            
  2     2001            1           agreement3       5 
  2     2002    2       
  2     2003                
  2     2004    1                   agreement 4      2

以下是希望更易于使用的格式的数据:

df<-structure(list(dyad = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
2L), year = c(1985L, 1986L, 1987L, 2001L, 2002L, 1999L, 2000L, 
2001L, 2002L, 2003L, 2004L), event = c(9L, 4L, NA, 3L, NA, 1L, 
5L, NA, 2L, NA, NA), agreement = c(NA, 1L, NA, NA, 1L, NA, NA, 
1L, NA, NA, 1L), agreement.name = c("", "agreement1", "", "", 
"agreement2", "", "", "agreement3", "", "", "agreement 4"), result = c(NA, 
9L, NA, NA, 3L, NA, NA, 5L, NA, NA, 2L)), .Names = c("dyad", 
"year", "event", "agreement", "agreement.name", "result"), class = "data.frame", row.names = c(NA, 
-11L))

这是一个使用 data.table 的选项。将'data.frame'转换为'data.table'(setDT(df)),根据'agreement.name'中的非空元素创建另一个分组变量('ind')。按 'dyad' 和 'ind' 列分组,我们创建一个新列 'result' 使用 ifelse 来填充具有 'agreement.name' 的行是非空的 max 共 'event'

library(data.table)
setDT(df)[, ind:=cumsum(c(TRUE,diff(agreement.name=='')>0)),dyad][,
    result:=ifelse(agreement.name!='', max(event, na.rm=TRUE), NA) ,
                list(dyad, ind)][, ind:=NULL][]
#       dyad year event agreement agreement.name result
# 1:    1 1985     9        NA                    NA
# 2:    1 1986     4         1     agreement1      9
# 3:    1 1987    NA        NA                    NA
# 4:    1 2001     3        NA                    NA
# 5:    1 2002    NA         1     agreement2      3
# 6:    2 1999     1        NA                    NA
# 7:    2 2000     5        NA                    NA
# 8:    2 2001    NA         1     agreement3      5
# 9:    2 2002     2        NA                    NA
#10:    2 2003    NA        NA                    NA
#11:    2 2004    NA         1    agreement 4      2

或者我们可以使用数字索引

而不是 ifelse
setDT(df)[, result:=c(NA, max(event, na.rm=TRUE))[(agreement.name!='')+1L] ,
   list(ind= cumsum(c(TRUE,diff(agreement.name=='')>0)),dyad)][]

数据

df <- structure(list(dyad = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
2L), year = c(1985L, 1986L, 1987L, 2001L, 2002L, 1999L, 2000L, 
2001L, 2002L, 2003L, 2004L), event = c(9L, 4L, NA, 3L, NA, 1L, 
5L, NA, 2L, NA, NA), agreement = c(NA, 1L, NA, NA, 1L, NA, NA, 
1L, NA, NA, 1L), agreement.name = c("", "agreement1", "", "", 
"agreement2", "", "", "agreement3", "", "", "agreement 4")), 
.Names = c("dyad", 
"year", "event", "agreement", "agreement.name"), row.names = c(NA,
-11L), class = "data.frame")