Dcast/merge基于某列,取值在一定范围内
Dcast/merge based on a column, with a value within a certain range
我有一个面板数据集:面板和一个包含事件列表的数据集:事件。对于面板数据集,相等的 panelID 表明两个观察值属于一起。
panelID = c(1:50)
year= c(2001:2010)
country = c("NLD", "GRC", "GBR")
n <- 2
library(data.table)
set.seed(123)
Panel <- data.table(panelID = rep(sample(panelID), each = n),
country = rep(sample(country, length(panelID), replace = T), each = n),
year = c(replicate(length(panelID), sample(year, n))),
some_NA = sample(0:5, 6),
some_NA_factor = sample(0:5, 6),
norm = round(runif(100)/10,2),
Income = round(rnorm(10,-5,5),2),
Happiness = sample(10,10),
Sex = round(rnorm(10,0.75,0.3),2),
Age = sample(100,100),
Educ = round(rnorm(10,0.75,0.3),2))
Panel[, uniqueID := .I] # Creates a unique ID
Panel[Panel == 0] <- NA
Events <- fread(
"Event_Type country year
A NLD 2005
A NLD 2004
A GBR 2006
A GBR 2003
A GRC 2002
A GRC 2007",
header = TRUE)
我想知道小组观察之间 Events
发生的频率,每年分开一次。例如,对于 panelID == 2
的小组观察,在 NLD 国家有两个事件,在小组观察的 years
之内或之间,即 2004 年和 2005 年。因此:
期望的输出:
panleID country year 2002 2003 2004 2005 2006 2007
2 NLD 2004 NA NA 1 1 NA NA
2 NLD 2007 NA NA 1 1 NA NA
根据 的解决方案,我尝试执行以下操作:
# cast Event
Events_cast <- reshape2::dcast(Events, country + year ~ year, length, value.var="year")
# update by reference for join later
Panel[, `:=`(start = min(year), end = max(year)), by = panelID]
# dcast sorts the rhs alphabetically
cols <- sort(unique(Events[['year']]))
# non-equi update join
Panel[Events_cast,
on = .(country,
start <= year,
end >= year),
(cols) := mget(cols)]
#clean up data frame
setnafill(Panel, fill = 0L, cols = cols)
Panel[, `:=`(start = NULL, end = NULL)]
Panel
但是在 # non-equi update join
我得到错误:Error in [.data.table (Panel, Events, on = .(country, : LHS of := appears to be column positions but are outside [1,ncol] range. New columns can only be added by name.
data.table 正在尝试使用年份来确定您选择的是哪一列。错误告诉你 2006
和其他年份不是有效的列号。修复很简单:
cols <- as.character(sort(unique(Events[['year']])))
这里是所有内容以及一些其他更改,包括:
- 使用
data.table::dcast
代替 reshape2::dcast
- 将
start
和 end
添加到 Events
data.table 并使用这些列进行转换。
# cast Event
# Events_cast <- reshape2::dcast(Events, country + year ~ year, length, value.var="year")
Events[, `:=`(start = min(year), end = max(year)), by = country]
Events_cast <- dcast(Events, country + start + end~ year, length)
# update by reference for join later
Panel[, `:=`(start = min(year), end = max(year)), by = panelID]
# dcast sorts the rhs alphabetically
cols <- as.character(sort(unique(Events[['year']])))
# non-equi update join
# Panel[Events_cast,
# on = .(country,
# start <= year,
# end >= year),
# (cols) := mget(cols)]
Panel[Events_cast,
on = .(country,
start <= start,
end >= end),
(cols) := mget(cols)]
#clean up data frame
setnafill(Panel, fill = 0L, cols = cols)
Panel[, `:=`(start = NULL, end = NULL)]
Panel
我有一个面板数据集:面板和一个包含事件列表的数据集:事件。对于面板数据集,相等的 panelID 表明两个观察值属于一起。
panelID = c(1:50)
year= c(2001:2010)
country = c("NLD", "GRC", "GBR")
n <- 2
library(data.table)
set.seed(123)
Panel <- data.table(panelID = rep(sample(panelID), each = n),
country = rep(sample(country, length(panelID), replace = T), each = n),
year = c(replicate(length(panelID), sample(year, n))),
some_NA = sample(0:5, 6),
some_NA_factor = sample(0:5, 6),
norm = round(runif(100)/10,2),
Income = round(rnorm(10,-5,5),2),
Happiness = sample(10,10),
Sex = round(rnorm(10,0.75,0.3),2),
Age = sample(100,100),
Educ = round(rnorm(10,0.75,0.3),2))
Panel[, uniqueID := .I] # Creates a unique ID
Panel[Panel == 0] <- NA
Events <- fread(
"Event_Type country year
A NLD 2005
A NLD 2004
A GBR 2006
A GBR 2003
A GRC 2002
A GRC 2007",
header = TRUE)
我想知道小组观察之间 Events
发生的频率,每年分开一次。例如,对于 panelID == 2
的小组观察,在 NLD 国家有两个事件,在小组观察的 years
之内或之间,即 2004 年和 2005 年。因此:
期望的输出:
panleID country year 2002 2003 2004 2005 2006 2007
2 NLD 2004 NA NA 1 1 NA NA
2 NLD 2007 NA NA 1 1 NA NA
根据
# cast Event
Events_cast <- reshape2::dcast(Events, country + year ~ year, length, value.var="year")
# update by reference for join later
Panel[, `:=`(start = min(year), end = max(year)), by = panelID]
# dcast sorts the rhs alphabetically
cols <- sort(unique(Events[['year']]))
# non-equi update join
Panel[Events_cast,
on = .(country,
start <= year,
end >= year),
(cols) := mget(cols)]
#clean up data frame
setnafill(Panel, fill = 0L, cols = cols)
Panel[, `:=`(start = NULL, end = NULL)]
Panel
但是在 # non-equi update join
我得到错误:Error in [.data.table (Panel, Events, on = .(country, : LHS of := appears to be column positions but are outside [1,ncol] range. New columns can only be added by name.
data.table 正在尝试使用年份来确定您选择的是哪一列。错误告诉你 2006
和其他年份不是有效的列号。修复很简单:
cols <- as.character(sort(unique(Events[['year']])))
这里是所有内容以及一些其他更改,包括:
- 使用
data.table::dcast
代替reshape2::dcast
- 将
start
和end
添加到Events
data.table 并使用这些列进行转换。
# cast Event
# Events_cast <- reshape2::dcast(Events, country + year ~ year, length, value.var="year")
Events[, `:=`(start = min(year), end = max(year)), by = country]
Events_cast <- dcast(Events, country + start + end~ year, length)
# update by reference for join later
Panel[, `:=`(start = min(year), end = max(year)), by = panelID]
# dcast sorts the rhs alphabetically
cols <- as.character(sort(unique(Events[['year']])))
# non-equi update join
# Panel[Events_cast,
# on = .(country,
# start <= year,
# end >= year),
# (cols) := mget(cols)]
Panel[Events_cast,
on = .(country,
start <= start,
end >= end),
(cols) := mget(cols)]
#clean up data frame
setnafill(Panel, fill = 0L, cols = cols)
Panel[, `:=`(start = NULL, end = NULL)]
Panel