以分组观察中的重复为条件的虚拟变量
Dummy variable conditioned on repetitions in grouped observations
编辑
感谢您的回复。但是,我仍然没有设法解决我的问题,因为我的数据集包含 700,000 个观察结果,并且下面的所有方法都会导致错误,或者只是继续 运行 几个小时而没有完成(我可以告诉 Rstudio R Session 运行 占用了我的大量 RAM,但它根本无处可去。
如您所想,将数据集拆分成更小的部分不是一种选择,因为这会违背练习的目的:我需要查看之前的每一个观察结果才能得到想要的结果。
有什么想法吗?我暂时不回答这个问题,但如果你们认为我应该 post 一个新问题,我会的(老实说,我不知道关于这些事情的礼节,所以请随意提出建议)。
原版post
如标题所示,我正在寻找一个虚拟变量,它以分组观察中的重复为条件。
考虑以下数据框:
id name year
1 c af 2000
2 c el 2000
3 c in 2000
4 c ud 2000
5 d ot 2000
6 d an 2000
7 d el 2000
8 d un 2000
9 f yt 2002
10 f ip 2002
11 f ot 2002
12 f el 2002
13 g yt 2003
14 g af 2003
15 g ol 2003
16 g in 2003
17 h in 2003
18 h eg 2003
19 h yt 2003
20 h af 2003
21 j ot 2004
22 j el 2004
23 j ip 2004
24 j yt 2004
我正在寻找一个函数,它允许我按 id 对数据进行分组,如果 id 至少包含 ,return 值“1”三个名字出现在以前的 id 中。通过以前的 id,我的意思是以前的 id 的年份 has 小于当前的 id。
所需的输出应如下所示:
id name year dummy
1 c af 2000 0
2 c el 2000 0
3 c in 2000 0
4 c ud 2000 0
5 d ot 2000 0
6 d an 2000 0
7 d el 2000 0
8 d un 2000 0
9 f yt 2002 0
10 f ip 2002 0
11 f ot 2002 0
12 f el 2002 0
13 g yt 2003 0
14 g af 2003 0
15 g ol 2003 0
16 g in 2003 0
17 h in 2003 0
18 h eg 2003 0
19 h yt 2003 0
20 h af 2003 0
21 j ot 2004 1
22 j el 2004 1
23 j ip 2004 1
24 j yt 2004 1
id = "j" 取值 dummy = "1",因为至少有三个名称,"yt"、"ip" 和 "ot",出现在 id = "f"。在这种情况下,还有第四个名字也出现了,"el",但这并不影响结果。
请注意,id = "h" 的值为 dummy = "0",即使三个名字也出现在 id = "g" 中。这是因为两次都发生在2003年,不满足年份分开的条件。
数据:
DF = structure(list(id = c("c", "c", "c", "c", "d", "d", "d", "d",
"f", "f", "f", "f", "g", "g", "g", "g", "h", "h", "h", "h", "j",
"j", "j", "j"), name = c("af", "el", "in", "ud", "ot", "an",
"el", "un", "yt", "ip", "ot", "el", "yt", "af", "ol", "in", "in",
"eg", "yt", "af", "ot", "el", "ip", "yt"), year = c(2000L, 2000L,
2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2002L, 2002L, 2002L,
2002L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L,
2004L, 2004L, 2004L, 2004L), dummy = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L,
1L, 1L)), .Names = c("id", "name", "year", "dummy"), row.names = c(NA,
-24L), class = "data.frame")
这是我的解决方案,使用 dplyr 和 tidyr 以及一个函数来识别具有 3 个或更多匹配名称的 ID:
library(dplyr)
library(tidyr)
test <- function(x){
out2 <- sapply(1:length(x), function(j){
out <- sapply(1:j, function(i){
sum(x[[j]] %in% x[[i]])
})
out[j]<-NA
which(out >= 3) %>% min() %>% {ifelse(is.infinite(.),NA,.)}
})
out2
}
DF2 <- DF %>% group_by(id, year) %>%
summarise(names = list(name)) %>% ungroup() %>%
mutate(dummy2 = test(names)) %>%
mutate(year_mch = year[dummy2],
dummy = year_mch < year) %>%
unnest()
DF2
由于无限值,它给出了一堆警告,但这并不影响结果。
基于 R 的一种方法:
n <- split(DF$name, DF$id)
m1 <- sapply(n, function(s1) sapply(n, function(s2) sum(s1 %in% s2) ))
diag(m1) <- 0
m1[upper.tri(m1)] <- 0
r1 <- rownames(m1)[!!rowSums(m1 > 2)]
y <- sapply(split(DF$year, DF$id), unique)
m2 <- sapply(y, function(s1) sapply(y, function(s2) +(s1 == s2) ))
diag(m2) <- 0
m2[upper.tri(m2)] <- 0
r2 <- rownames(m2)[!rowSums(m2)]
DF$dummy2 <- as.integer(DF$id %in% intersect(r1,r2))
给出:
> DF
id name year dummy dummy2
1 c af 2000 0 0
2 c el 2000 0 0
3 c in 2000 0 0
4 c ud 2000 0 0
5 d ot 2000 0 0
6 d an 2000 0 0
7 d el 2000 0 0
8 d un 2000 0 0
9 f yt 2002 0 0
10 f ip 2002 0 0
11 f ot 2002 0 0
12 f el 2002 0 0
13 g yt 2003 0 0
14 g af 2003 0 0
15 g ol 2003 0 0
16 g in 2003 0 0
17 h in 2003 0 0
18 h eg 2003 0 0
19 h yt 2003 0 0
20 h af 2003 0 0
21 j ot 2004 1 1
22 j el 2004 1 1
23 j ip 2004 1 1
24 j yt 2004 1 1
与 Jaap 和 see24 类似,但使用 length(intersect(x,y))
而不是 ==
/%in%
和 rowSums
/sum
:
library(data.table)
setDT(DF)
idDT = unique(DF[, .(id, year)])
setkey(idDT, id)
s = split(DF$name, DF$id)
# identify pairs of ids, where id1 appears before id2 in the table
pairsDT = idDT[, CJ(id1 = id, id2 = id)[id1 < id2]]
# record whether it's strictly before
pairsDT[, earlier := idDT[id1, x.year] < idDT[id2, x.year]]
# if it's strictly before, compare number of matching elements
pairsDT[earlier == TRUE, matched :=
mapply(function(x, y) length(intersect(x, y)), s[id1], s[id2]) >= 3
]
dum_ids = pairsDT[matched == TRUE, unique(id2)]
然后您可以在 idDT 或 DF 中记录标准:
idDT[, dum := id %in% dum_ids]
DF[, dum := id %in% dum_ids]
在基础 R 中,可以使用 combn
完成类似的事情。我想与仅将数据存储在图形中(例如,使用 igraph 包)并从那里开始工作相比,这仍然是非常低效的。
我会找任何借口将数据问题转换为图形问题,所以为提出这个问题的弗兰克干杯。这是一个 igraph
解决方案。本质上,它将数据转换为有向树。所有节点仅与层次结构中更高的节点进行比较。所以 C 是树的顶部,不与其他任何东西进行比较,而 J 是终端,并与它上面的所有节点进行比较连锁,链条。要拉出层次结构中较高的所有节点,您需要做的就是使用(深度优先搜索)dfs
函数
library(tidyverse)
library(igraph)
#node list containing data specific to the group
nodelist <- DF %>%
group_by(id, year) %>%
nest()
#edge list containing connections. A group directly before a node points toward a future group
edgelist <- data.frame(
from = nodelist$id %>% .[1:(length(.)-1)],
to = nodelist$id %>% .[2:length(.)]
)
#create the data frame
g <- graph_from_data_frame(edgelist, T, nodelist)
#let's iterate through the nodes
dummy <- map_lgl(V(g)$name, function(vertex){
#depth first search to pull out all nodes higher up on the tree
full_path <- dfs(g, vertex, 'in', unreachable = F) %>%
.$order %>%
.[!is.na(.)]
#if there is no node higher up, then we're done
if(length(full_path) <= 1) return(F)
#The first node returned is the node we're iterating over
this_vertex <- full_path[1]
other_vertices <- full_path[full_path != this_vertex]
#this is the logic for the dummy variable
similar_groups <- map_lgl(other_vertices, function(other_vertex){
(sum(this_vertex$data[[1]]$name %in% other_vertex$data$name) >= 3) &
(this_vertex$year[[1]] != other_vertex$year)
})
return(T %in% similar_groups)
})
V(g)$dummy2 <- dummy
as_data_frame(g, 'vertices') %>%
unnest()
name year dummy2 name1 dummy
1 c 2000 FALSE af 0
2 c 2000 FALSE el 0
3 c 2000 FALSE in 0
4 c 2000 FALSE ud 0
5 d 2000 FALSE ot 0
6 d 2000 FALSE an 0
7 d 2000 FALSE el 0
8 d 2000 FALSE un 0
9 f 2002 FALSE yt 0
10 f 2002 FALSE ip 0
11 f 2002 FALSE ot 0
12 f 2002 FALSE el 0
13 g 2003 FALSE yt 0
14 g 2003 FALSE af 0
15 g 2003 FALSE ol 0
16 g 2003 FALSE in 0
17 h 2003 FALSE in 0
18 h 2003 FALSE eg 0
19 h 2003 FALSE yt 0
20 h 2003 FALSE af 0
21 j 2004 TRUE ot 1
22 j 2004 TRUE el 1
23 j 2004 TRUE ip 1
24 j 2004 TRUE yt 1
所以这个解决方案是纯粹的基础 R。我曾经读过一篇文章,声称使用 . <-
是 %>%
的有效替代品。这是我第一次尝试。我觉得我喜欢
. <- DF[c('id', 'name', 'year')]
. <- merge(., ., by = 'name')
. <- .[.["id.x"] != .["id.y"] & .["year.x"] < .["year.y"],]
. <- .[c('id.x', 'id.y', 'year.x', 'year.y', "name")]
.$n <- 1
. <- aggregate(n ~ id.x + id.y, data = ., sum)
. <- .[.['n'] >= 3, 'id.y']
DF$dummy2 <- . == DF$id
在 OP 对速度和内存问题进行编辑后,Rcpp
方法如何:
#create a integer column out of id for non-equi join later
setDT(DF)[, nid := rleid(id)]
#convert name into an integer code
DF[DF[,.(name=unique(name))][, IntCode := .I], iname := IntCode, on=.(name)]
library(inline)
library(Rcpp)
cppFunction('
NumericVector hasOccur(NumericVector nid, NumericVector year, List iname) {
List namelist(iname);
int sz = namelist.size(), i, j, m, n, nPrev, nCurr, count;
NumericVector res(sz);
for(i=0; i<sz; i++) {
for(j=0; j<i; j++) {
if (nid[j] < nid[i] && year[j] < year[i]) {
SEXP prevList = namelist[j];
SEXP currList = namelist[i];
NumericVector cl(currList);
NumericVector pl(prevList);
nPrev = pl.size();
nCurr = cl.size();
res[i] = 0;
count = 0;
for(m=0; m<nCurr; m++) {
for (n=0; n<nPrev; n++) {
if (cl[m] == pl[n]) {
count++;
break;
}
}
}
if (count >= 3) {
res[i] = 1;
break;
}
}
}
}
return(res);
}')
d <- DF[, .(.(nm=iname)), by=.(nid, year)]
DF[d[, dummy := hasOccur(d$nid, d$year, d$V1)], dummy := dummy, on=.(nid, year)]
HTH.
另一种可能的data.table
方法:
#create a integer column out of id for non-equi join later
setDT(DF)[, nid := rleid(id)]
#self non-equi join
check3 <- DF[DF, .(x.id, x.name, x.year, x.nid, i.id, i.name, i.year, i.nid), on=.(nid<nid, year<year, name=name)][,
#count the number of occurrence in previous id and year
uniqueN(x.name, na.rm=TRUE), by=.(i.id, i.year, x.id, x.year)][,
#check if more than 3
any(V1 >= 3L), by=.(i.id, i.year)]
#update join to add result to original DF
DF[check3, dummy := as.integer(V1), on=c("id"="i.id", "year"="i.year")]
编辑
感谢您的回复。但是,我仍然没有设法解决我的问题,因为我的数据集包含 700,000 个观察结果,并且下面的所有方法都会导致错误,或者只是继续 运行 几个小时而没有完成(我可以告诉 Rstudio R Session 运行 占用了我的大量 RAM,但它根本无处可去。
如您所想,将数据集拆分成更小的部分不是一种选择,因为这会违背练习的目的:我需要查看之前的每一个观察结果才能得到想要的结果。
有什么想法吗?我暂时不回答这个问题,但如果你们认为我应该 post 一个新问题,我会的(老实说,我不知道关于这些事情的礼节,所以请随意提出建议)。
原版post
如标题所示,我正在寻找一个虚拟变量,它以分组观察中的重复为条件。
考虑以下数据框:
id name year
1 c af 2000
2 c el 2000
3 c in 2000
4 c ud 2000
5 d ot 2000
6 d an 2000
7 d el 2000
8 d un 2000
9 f yt 2002
10 f ip 2002
11 f ot 2002
12 f el 2002
13 g yt 2003
14 g af 2003
15 g ol 2003
16 g in 2003
17 h in 2003
18 h eg 2003
19 h yt 2003
20 h af 2003
21 j ot 2004
22 j el 2004
23 j ip 2004
24 j yt 2004
我正在寻找一个函数,它允许我按 id 对数据进行分组,如果 id 至少包含 ,return 值“1”三个名字出现在以前的 id 中。通过以前的 id,我的意思是以前的 id 的年份 has 小于当前的 id。
所需的输出应如下所示:
id name year dummy
1 c af 2000 0
2 c el 2000 0
3 c in 2000 0
4 c ud 2000 0
5 d ot 2000 0
6 d an 2000 0
7 d el 2000 0
8 d un 2000 0
9 f yt 2002 0
10 f ip 2002 0
11 f ot 2002 0
12 f el 2002 0
13 g yt 2003 0
14 g af 2003 0
15 g ol 2003 0
16 g in 2003 0
17 h in 2003 0
18 h eg 2003 0
19 h yt 2003 0
20 h af 2003 0
21 j ot 2004 1
22 j el 2004 1
23 j ip 2004 1
24 j yt 2004 1
id = "j" 取值 dummy = "1",因为至少有三个名称,"yt"、"ip" 和 "ot",出现在 id = "f"。在这种情况下,还有第四个名字也出现了,"el",但这并不影响结果。
请注意,id = "h" 的值为 dummy = "0",即使三个名字也出现在 id = "g" 中。这是因为两次都发生在2003年,不满足年份分开的条件。
数据:
DF = structure(list(id = c("c", "c", "c", "c", "d", "d", "d", "d",
"f", "f", "f", "f", "g", "g", "g", "g", "h", "h", "h", "h", "j",
"j", "j", "j"), name = c("af", "el", "in", "ud", "ot", "an",
"el", "un", "yt", "ip", "ot", "el", "yt", "af", "ol", "in", "in",
"eg", "yt", "af", "ot", "el", "ip", "yt"), year = c(2000L, 2000L,
2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2002L, 2002L, 2002L,
2002L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L,
2004L, 2004L, 2004L, 2004L), dummy = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L,
1L, 1L)), .Names = c("id", "name", "year", "dummy"), row.names = c(NA,
-24L), class = "data.frame")
这是我的解决方案,使用 dplyr 和 tidyr 以及一个函数来识别具有 3 个或更多匹配名称的 ID:
library(dplyr)
library(tidyr)
test <- function(x){
out2 <- sapply(1:length(x), function(j){
out <- sapply(1:j, function(i){
sum(x[[j]] %in% x[[i]])
})
out[j]<-NA
which(out >= 3) %>% min() %>% {ifelse(is.infinite(.),NA,.)}
})
out2
}
DF2 <- DF %>% group_by(id, year) %>%
summarise(names = list(name)) %>% ungroup() %>%
mutate(dummy2 = test(names)) %>%
mutate(year_mch = year[dummy2],
dummy = year_mch < year) %>%
unnest()
DF2
由于无限值,它给出了一堆警告,但这并不影响结果。
基于 R 的一种方法:
n <- split(DF$name, DF$id)
m1 <- sapply(n, function(s1) sapply(n, function(s2) sum(s1 %in% s2) ))
diag(m1) <- 0
m1[upper.tri(m1)] <- 0
r1 <- rownames(m1)[!!rowSums(m1 > 2)]
y <- sapply(split(DF$year, DF$id), unique)
m2 <- sapply(y, function(s1) sapply(y, function(s2) +(s1 == s2) ))
diag(m2) <- 0
m2[upper.tri(m2)] <- 0
r2 <- rownames(m2)[!rowSums(m2)]
DF$dummy2 <- as.integer(DF$id %in% intersect(r1,r2))
给出:
> DF id name year dummy dummy2 1 c af 2000 0 0 2 c el 2000 0 0 3 c in 2000 0 0 4 c ud 2000 0 0 5 d ot 2000 0 0 6 d an 2000 0 0 7 d el 2000 0 0 8 d un 2000 0 0 9 f yt 2002 0 0 10 f ip 2002 0 0 11 f ot 2002 0 0 12 f el 2002 0 0 13 g yt 2003 0 0 14 g af 2003 0 0 15 g ol 2003 0 0 16 g in 2003 0 0 17 h in 2003 0 0 18 h eg 2003 0 0 19 h yt 2003 0 0 20 h af 2003 0 0 21 j ot 2004 1 1 22 j el 2004 1 1 23 j ip 2004 1 1 24 j yt 2004 1 1
与 Jaap 和 see24 类似,但使用 length(intersect(x,y))
而不是 ==
/%in%
和 rowSums
/sum
:
library(data.table)
setDT(DF)
idDT = unique(DF[, .(id, year)])
setkey(idDT, id)
s = split(DF$name, DF$id)
# identify pairs of ids, where id1 appears before id2 in the table
pairsDT = idDT[, CJ(id1 = id, id2 = id)[id1 < id2]]
# record whether it's strictly before
pairsDT[, earlier := idDT[id1, x.year] < idDT[id2, x.year]]
# if it's strictly before, compare number of matching elements
pairsDT[earlier == TRUE, matched :=
mapply(function(x, y) length(intersect(x, y)), s[id1], s[id2]) >= 3
]
dum_ids = pairsDT[matched == TRUE, unique(id2)]
然后您可以在 idDT 或 DF 中记录标准:
idDT[, dum := id %in% dum_ids]
DF[, dum := id %in% dum_ids]
在基础 R 中,可以使用 combn
完成类似的事情。我想与仅将数据存储在图形中(例如,使用 igraph 包)并从那里开始工作相比,这仍然是非常低效的。
我会找任何借口将数据问题转换为图形问题,所以为提出这个问题的弗兰克干杯。这是一个 igraph
解决方案。本质上,它将数据转换为有向树。所有节点仅与层次结构中更高的节点进行比较。所以 C 是树的顶部,不与其他任何东西进行比较,而 J 是终端,并与它上面的所有节点进行比较连锁,链条。要拉出层次结构中较高的所有节点,您需要做的就是使用(深度优先搜索)dfs
函数
library(tidyverse)
library(igraph)
#node list containing data specific to the group
nodelist <- DF %>%
group_by(id, year) %>%
nest()
#edge list containing connections. A group directly before a node points toward a future group
edgelist <- data.frame(
from = nodelist$id %>% .[1:(length(.)-1)],
to = nodelist$id %>% .[2:length(.)]
)
#create the data frame
g <- graph_from_data_frame(edgelist, T, nodelist)
#let's iterate through the nodes
dummy <- map_lgl(V(g)$name, function(vertex){
#depth first search to pull out all nodes higher up on the tree
full_path <- dfs(g, vertex, 'in', unreachable = F) %>%
.$order %>%
.[!is.na(.)]
#if there is no node higher up, then we're done
if(length(full_path) <= 1) return(F)
#The first node returned is the node we're iterating over
this_vertex <- full_path[1]
other_vertices <- full_path[full_path != this_vertex]
#this is the logic for the dummy variable
similar_groups <- map_lgl(other_vertices, function(other_vertex){
(sum(this_vertex$data[[1]]$name %in% other_vertex$data$name) >= 3) &
(this_vertex$year[[1]] != other_vertex$year)
})
return(T %in% similar_groups)
})
V(g)$dummy2 <- dummy
as_data_frame(g, 'vertices') %>%
unnest()
name year dummy2 name1 dummy
1 c 2000 FALSE af 0
2 c 2000 FALSE el 0
3 c 2000 FALSE in 0
4 c 2000 FALSE ud 0
5 d 2000 FALSE ot 0
6 d 2000 FALSE an 0
7 d 2000 FALSE el 0
8 d 2000 FALSE un 0
9 f 2002 FALSE yt 0
10 f 2002 FALSE ip 0
11 f 2002 FALSE ot 0
12 f 2002 FALSE el 0
13 g 2003 FALSE yt 0
14 g 2003 FALSE af 0
15 g 2003 FALSE ol 0
16 g 2003 FALSE in 0
17 h 2003 FALSE in 0
18 h 2003 FALSE eg 0
19 h 2003 FALSE yt 0
20 h 2003 FALSE af 0
21 j 2004 TRUE ot 1
22 j 2004 TRUE el 1
23 j 2004 TRUE ip 1
24 j 2004 TRUE yt 1
所以这个解决方案是纯粹的基础 R。我曾经读过一篇文章,声称使用 . <-
是 %>%
的有效替代品。这是我第一次尝试。我觉得我喜欢
. <- DF[c('id', 'name', 'year')]
. <- merge(., ., by = 'name')
. <- .[.["id.x"] != .["id.y"] & .["year.x"] < .["year.y"],]
. <- .[c('id.x', 'id.y', 'year.x', 'year.y', "name")]
.$n <- 1
. <- aggregate(n ~ id.x + id.y, data = ., sum)
. <- .[.['n'] >= 3, 'id.y']
DF$dummy2 <- . == DF$id
在 OP 对速度和内存问题进行编辑后,Rcpp
方法如何:
#create a integer column out of id for non-equi join later
setDT(DF)[, nid := rleid(id)]
#convert name into an integer code
DF[DF[,.(name=unique(name))][, IntCode := .I], iname := IntCode, on=.(name)]
library(inline)
library(Rcpp)
cppFunction('
NumericVector hasOccur(NumericVector nid, NumericVector year, List iname) {
List namelist(iname);
int sz = namelist.size(), i, j, m, n, nPrev, nCurr, count;
NumericVector res(sz);
for(i=0; i<sz; i++) {
for(j=0; j<i; j++) {
if (nid[j] < nid[i] && year[j] < year[i]) {
SEXP prevList = namelist[j];
SEXP currList = namelist[i];
NumericVector cl(currList);
NumericVector pl(prevList);
nPrev = pl.size();
nCurr = cl.size();
res[i] = 0;
count = 0;
for(m=0; m<nCurr; m++) {
for (n=0; n<nPrev; n++) {
if (cl[m] == pl[n]) {
count++;
break;
}
}
}
if (count >= 3) {
res[i] = 1;
break;
}
}
}
}
return(res);
}')
d <- DF[, .(.(nm=iname)), by=.(nid, year)]
DF[d[, dummy := hasOccur(d$nid, d$year, d$V1)], dummy := dummy, on=.(nid, year)]
HTH.
另一种可能的data.table
方法:
#create a integer column out of id for non-equi join later
setDT(DF)[, nid := rleid(id)]
#self non-equi join
check3 <- DF[DF, .(x.id, x.name, x.year, x.nid, i.id, i.name, i.year, i.nid), on=.(nid<nid, year<year, name=name)][,
#count the number of occurrence in previous id and year
uniqueN(x.name, na.rm=TRUE), by=.(i.id, i.year, x.id, x.year)][,
#check if more than 3
any(V1 >= 3L), by=.(i.id, i.year)]
#update join to add result to original DF
DF[check3, dummy := as.integer(V1), on=c("id"="i.id", "year"="i.year")]