条件合并,基于两个面板观察之间发生的事件

Conditional merge, based an event happening between two panel observations

我有一个面板数据集:panel 和一个包含事件列表的数据集:Events。对于面板数据集,相等的 panelID 表明两个观察值属于一起。

panelID = c(1:50)   
year= c(2001:2010)
country = c("NLD", "GRC", "GBR")

n <- 2

library(data.table)
set.seed(123)
Panel <- data.table(panelID = rep(sample(panelID), each = n),
                 country = rep(sample(country, length(panelID), replace = T), each = n),
                 year = c(replicate(length(panelID), sample(year, n))),
                 some_NA = sample(0:5, 6),                                             
                 some_NA_factor = sample(0:5, 6),         
                 norm = round(runif(100)/10,2),
                 Income = round(rnorm(10,-5,5),2),
                 Happiness = sample(10,10),
                 Sex = round(rnorm(10,0.75,0.3),2),
                 Age = sample(100,100),
                 Educ = round(rnorm(10,0.75,0.3),2))        
Panel[, uniqueID := .I]                                                                        # Creates a unique ID     
Panel[Panel == 0] <- NA    


Events <- fread(
"Event_Type  country year   
A   NLD   2005
C   NLD   2004       
A   GBR   2006
B   GBR   2003   
A   GRC   2002             
D   GRC   2007",
header = TRUE)

============================================= =================================== 编辑:

Events <- fread(
"Event_Type  country year   
A   NLD   2005
A   NLD   2004       
A   GBR   2006
A   GBR   2003   
A   GRC   2002             
A   GRC   2007",
header = TRUE)

编辑后的预期结果:

panleID country year 2002  2003  2004 2005 2006 2007 
1       NLD     2002 NA    NA    1    1    NA   NA 
1       NLD     2006 NA    NA    1    1    NA   NA 

============================================= ==========================

我希望将 Event_Type 列中的值添加到 Panel,如果事件的 year 在两个小组观察之间(并且在同一个国家/地区) .

举个例子,我们来看下面的面板观察:

panleID country year
1       NLD     2002
1       NLD     2006

Panel 将获得 4 个额外的列 AD。如果 2005 年发生在国家 NLD 的事件(第一行 Events,发生在其中一个或两者之间,则 A 列将在该列中得到一个 1两年。由于这样做,结果如下:

panleID country year A  B  C  D 
1       NLD     2002 1  NA NA NA
1       NLD     2006 1  NA NA NA

我知道合并同年是这样的:

merge(Panel, dcast(Events, iso + country ~ Event_Type),
      by = c("country", "year"))

但是如果我希望值等于或介于两个 panelID 年之间,我应该如何进行合并?

下面是使用 data.table 解决您的问题 代码可以缩短,但我总是发现它很有用(特别是在 SO 上)显示中间的所有步骤以便于错误检查和验证。

#first, summarise Panel, to get the time-span of the panelID
Panel.short <- Panel[, .(country = unique(country), 
                         start = min(year), 
                         end = max(year) ), 
                     by = .(panelID)]
#    panelID country start  end
# 1:       1     NLD  2002 2006

#perform left non-equi join
Panel.short.joined <- Events[ Panel.short, on =.(country, year >= start, year <= end), mult = "all"][]
#    Event_Type country year year.1 panelID
# 1:          A     NLD 2002   2006       1
# 2:          C     NLD 2002   2006       1

#cast to wide
Panel.final <- dcast( Panel.short.joined, 
       panelID + country ~ Event_Type, 
       fun.aggregate = length )
#    panelID country A C
# 1:       1     NLD 1 1

#perform update join on the original Panel
Panel[, `:=`(A=0, B=0, C=0, D=0)][ 
  Panel.final, 
  `:=`( A = i.A, C = i.C),   # <- add B = i.B and D = i.D here 
  on = .( panelID )][]
#    panelID country year A B C D
# 1:       1     NLD 2002 1 0 1 0
# 2:       1     NLD 2006 1 0 1 0

这与@Wimpel 类似,但将顺序更改为:

  1. Events投射到宽
  2. 通过参考更新 panelID
  3. 的年份范围
  4. 非等值更新加入
# cast Event 
Events_cast <- dcast(Events, country + year~Event_Type, length)

# update by reference for join later
Panel[, `:=`(start = min(year), end = max(year)), by = panelID]

# dcast sorts the rhs alphabetically
cols <- sort(unique(Events[['Event_Type']]))

# non-equi update join
Panel[Events_cast,
      on = .(country,
             start <= year,
             end >= year),
      (cols) := mget(cols)]

#clean up data frame
setnafill(Panel, fill = 0L, cols = cols)
Panel[, `:=`(start = NULL, end = NULL)]

Panel

我会考虑使用 'between' 和“.SD”。无法按照您的示例进行操作,因此通常是:

DT[between(startYear, endYear, incbounds=FALSE,][, dcast(,.SD, cat1 ~ cat2 ...)]

注意:通过将 data.table 传递给 .SD 进行转换,您可以使用 i 进一步子集化。