条件合并,基于两个面板观察之间发生的事件
Conditional merge, based an event happening between two panel observations
我有一个面板数据集:panel
和一个包含事件列表的数据集:Events
。对于面板数据集,相等的 panelID
表明两个观察值属于一起。
panelID = c(1:50)
year= c(2001:2010)
country = c("NLD", "GRC", "GBR")
n <- 2
library(data.table)
set.seed(123)
Panel <- data.table(panelID = rep(sample(panelID), each = n),
country = rep(sample(country, length(panelID), replace = T), each = n),
year = c(replicate(length(panelID), sample(year, n))),
some_NA = sample(0:5, 6),
some_NA_factor = sample(0:5, 6),
norm = round(runif(100)/10,2),
Income = round(rnorm(10,-5,5),2),
Happiness = sample(10,10),
Sex = round(rnorm(10,0.75,0.3),2),
Age = sample(100,100),
Educ = round(rnorm(10,0.75,0.3),2))
Panel[, uniqueID := .I] # Creates a unique ID
Panel[Panel == 0] <- NA
Events <- fread(
"Event_Type country year
A NLD 2005
C NLD 2004
A GBR 2006
B GBR 2003
A GRC 2002
D GRC 2007",
header = TRUE)
============================================= ===================================
编辑:
Events <- fread(
"Event_Type country year
A NLD 2005
A NLD 2004
A GBR 2006
A GBR 2003
A GRC 2002
A GRC 2007",
header = TRUE)
编辑后的预期结果:
panleID country year 2002 2003 2004 2005 2006 2007
1 NLD 2002 NA NA 1 1 NA NA
1 NLD 2006 NA NA 1 1 NA NA
============================================= ==========================
我希望将 Event_Type
列中的值添加到 Panel
,如果事件的 year
在两个小组观察之间(并且在同一个国家/地区) .
举个例子,我们来看下面的面板观察:
panleID country year
1 NLD 2002
1 NLD 2006
Panel
将获得 4 个额外的列 A
到 D
。如果 2005 年发生在国家 NLD
的事件(第一行 Events
,发生在其中一个或两者之间,则 A
列将在该列中得到一个 1
两年。由于这样做,结果如下:
panleID country year A B C D
1 NLD 2002 1 NA NA NA
1 NLD 2006 1 NA NA NA
我知道合并同年是这样的:
merge(Panel, dcast(Events, iso + country ~ Event_Type),
by = c("country", "year"))
但是如果我希望值等于或介于两个 panelID
年之间,我应该如何进行合并?
下面是使用 data.table
解决您的问题
代码可以缩短,但我总是发现它很有用(特别是在 SO 上)显示中间的所有步骤以便于错误检查和验证。
#first, summarise Panel, to get the time-span of the panelID
Panel.short <- Panel[, .(country = unique(country),
start = min(year),
end = max(year) ),
by = .(panelID)]
# panelID country start end
# 1: 1 NLD 2002 2006
#perform left non-equi join
Panel.short.joined <- Events[ Panel.short, on =.(country, year >= start, year <= end), mult = "all"][]
# Event_Type country year year.1 panelID
# 1: A NLD 2002 2006 1
# 2: C NLD 2002 2006 1
#cast to wide
Panel.final <- dcast( Panel.short.joined,
panelID + country ~ Event_Type,
fun.aggregate = length )
# panelID country A C
# 1: 1 NLD 1 1
#perform update join on the original Panel
Panel[, `:=`(A=0, B=0, C=0, D=0)][
Panel.final,
`:=`( A = i.A, C = i.C), # <- add B = i.B and D = i.D here
on = .( panelID )][]
# panelID country year A B C D
# 1: 1 NLD 2002 1 0 1 0
# 2: 1 NLD 2006 1 0 1 0
这与@Wimpel 类似,但将顺序更改为:
- 将
Events
投射到宽
- 通过参考更新
panelID
的年份范围
- 非等值更新加入
# cast Event
Events_cast <- dcast(Events, country + year~Event_Type, length)
# update by reference for join later
Panel[, `:=`(start = min(year), end = max(year)), by = panelID]
# dcast sorts the rhs alphabetically
cols <- sort(unique(Events[['Event_Type']]))
# non-equi update join
Panel[Events_cast,
on = .(country,
start <= year,
end >= year),
(cols) := mget(cols)]
#clean up data frame
setnafill(Panel, fill = 0L, cols = cols)
Panel[, `:=`(start = NULL, end = NULL)]
Panel
我会考虑使用 'between' 和“.SD”。无法按照您的示例进行操作,因此通常是:
DT[between(startYear, endYear, incbounds=FALSE,][, dcast(,.SD, cat1 ~
cat2 ...)]
注意:通过将 data.table 传递给 .SD 进行转换,您可以使用 i 进一步子集化。
我有一个面板数据集:panel
和一个包含事件列表的数据集:Events
。对于面板数据集,相等的 panelID
表明两个观察值属于一起。
panelID = c(1:50)
year= c(2001:2010)
country = c("NLD", "GRC", "GBR")
n <- 2
library(data.table)
set.seed(123)
Panel <- data.table(panelID = rep(sample(panelID), each = n),
country = rep(sample(country, length(panelID), replace = T), each = n),
year = c(replicate(length(panelID), sample(year, n))),
some_NA = sample(0:5, 6),
some_NA_factor = sample(0:5, 6),
norm = round(runif(100)/10,2),
Income = round(rnorm(10,-5,5),2),
Happiness = sample(10,10),
Sex = round(rnorm(10,0.75,0.3),2),
Age = sample(100,100),
Educ = round(rnorm(10,0.75,0.3),2))
Panel[, uniqueID := .I] # Creates a unique ID
Panel[Panel == 0] <- NA
Events <- fread(
"Event_Type country year
A NLD 2005
C NLD 2004
A GBR 2006
B GBR 2003
A GRC 2002
D GRC 2007",
header = TRUE)
============================================= =================================== 编辑:
Events <- fread(
"Event_Type country year
A NLD 2005
A NLD 2004
A GBR 2006
A GBR 2003
A GRC 2002
A GRC 2007",
header = TRUE)
编辑后的预期结果:
panleID country year 2002 2003 2004 2005 2006 2007
1 NLD 2002 NA NA 1 1 NA NA
1 NLD 2006 NA NA 1 1 NA NA
============================================= ==========================
我希望将 Event_Type
列中的值添加到 Panel
,如果事件的 year
在两个小组观察之间(并且在同一个国家/地区) .
举个例子,我们来看下面的面板观察:
panleID country year
1 NLD 2002
1 NLD 2006
Panel
将获得 4 个额外的列 A
到 D
。如果 2005 年发生在国家 NLD
的事件(第一行 Events
,发生在其中一个或两者之间,则 A
列将在该列中得到一个 1
两年。由于这样做,结果如下:
panleID country year A B C D
1 NLD 2002 1 NA NA NA
1 NLD 2006 1 NA NA NA
我知道合并同年是这样的:
merge(Panel, dcast(Events, iso + country ~ Event_Type),
by = c("country", "year"))
但是如果我希望值等于或介于两个 panelID
年之间,我应该如何进行合并?
下面是使用 data.table
解决您的问题
代码可以缩短,但我总是发现它很有用(特别是在 SO 上)显示中间的所有步骤以便于错误检查和验证。
#first, summarise Panel, to get the time-span of the panelID
Panel.short <- Panel[, .(country = unique(country),
start = min(year),
end = max(year) ),
by = .(panelID)]
# panelID country start end
# 1: 1 NLD 2002 2006
#perform left non-equi join
Panel.short.joined <- Events[ Panel.short, on =.(country, year >= start, year <= end), mult = "all"][]
# Event_Type country year year.1 panelID
# 1: A NLD 2002 2006 1
# 2: C NLD 2002 2006 1
#cast to wide
Panel.final <- dcast( Panel.short.joined,
panelID + country ~ Event_Type,
fun.aggregate = length )
# panelID country A C
# 1: 1 NLD 1 1
#perform update join on the original Panel
Panel[, `:=`(A=0, B=0, C=0, D=0)][
Panel.final,
`:=`( A = i.A, C = i.C), # <- add B = i.B and D = i.D here
on = .( panelID )][]
# panelID country year A B C D
# 1: 1 NLD 2002 1 0 1 0
# 2: 1 NLD 2006 1 0 1 0
这与@Wimpel 类似,但将顺序更改为:
- 将
Events
投射到宽 - 通过参考更新
panelID
的年份范围
- 非等值更新加入
# cast Event
Events_cast <- dcast(Events, country + year~Event_Type, length)
# update by reference for join later
Panel[, `:=`(start = min(year), end = max(year)), by = panelID]
# dcast sorts the rhs alphabetically
cols <- sort(unique(Events[['Event_Type']]))
# non-equi update join
Panel[Events_cast,
on = .(country,
start <= year,
end >= year),
(cols) := mget(cols)]
#clean up data frame
setnafill(Panel, fill = 0L, cols = cols)
Panel[, `:=`(start = NULL, end = NULL)]
Panel
我会考虑使用 'between' 和“.SD”。无法按照您的示例进行操作,因此通常是:
DT[between(startYear, endYear, incbounds=FALSE,][, dcast(,.SD, cat1 ~ cat2 ...)]
注意:通过将 data.table 传递给 .SD 进行转换,您可以使用 i 进一步子集化。