以分组观察中的重复为条件的虚拟变量

Question

编辑

感谢您的回复。但是，我仍然没有设法解决我的问题，因为我的数据集包含 700,000 个观察结果，并且下面的所有方法都会导致错误，或者只是继续运行几个小时而没有完成（我可以告诉 Rstudio R Session 运行占用了我的大量 RAM，但它根本无处可去。

如您所想，将数据集拆分成更小的部分不是一种选择，因为这会违背练习的目的：我需要查看之前的每一个观察结果才能得到想要的结果。

有什么想法吗？我暂时不回答这个问题，但如果你们认为我应该 post 一个新问题，我会的（老实说，我不知道关于这些事情的礼节，所以请随意提出建议）。

原版post

如标题所示，我正在寻找一个虚拟变量，它以分组观察中的重复为条件。

考虑以下数据框：

   id name year
1   c   af 2000
2   c   el 2000
3   c   in 2000
4   c   ud 2000
5   d   ot 2000
6   d   an 2000
7   d   el 2000
8   d   un 2000
9   f   yt 2002
10  f   ip 2002
11  f   ot 2002
12  f   el 2002
13  g   yt 2003
14  g   af 2003
15  g   ol 2003
16  g   in 2003
17  h   in 2003
18  h   eg 2003
19  h   yt 2003
20  h   af 2003
21  j   ot 2004
22  j   el 2004
23  j   ip 2004
24  j   yt 2004

我正在寻找一个函数，它允许我按 id 对数据进行分组，如果 id 至少包含，return 值“1”三个名字出现在以前的 id 中。通过以前的 id，我的意思是以前的 id 的年份 has 小于当前的 id。

所需的输出应如下所示：

   id name year dummy
1   c   af 2000     0
2   c   el 2000     0
3   c   in 2000     0
4   c   ud 2000     0
5   d   ot 2000     0
6   d   an 2000     0
7   d   el 2000     0
8   d   un 2000     0
9   f   yt 2002     0
10  f   ip 2002     0
11  f   ot 2002     0
12  f   el 2002     0
13  g   yt 2003     0
14  g   af 2003     0
15  g   ol 2003     0
16  g   in 2003     0
17  h   in 2003     0
18  h   eg 2003     0
19  h   yt 2003     0
20  h   af 2003     0
21  j   ot 2004     1
22  j   el 2004     1
23  j   ip 2004     1
24  j   yt 2004     1

id = "j" 取值 dummy = "1"，因为至少有三个名称，"yt"、"ip" 和 "ot"，出现在 id = "f"。在这种情况下，还有第四个名字也出现了，"el"，但这并不影响结果。

请注意，id = "h" 的值为 dummy = "0"，即使三个名字也出现在 id = "g" 中。这是因为两次都发生在2003年，不满足年份分开的条件。

数据：

DF = structure(list(id = c("c", "c", "c", "c", "d", "d", "d", "d", 
"f", "f", "f", "f", "g", "g", "g", "g", "h", "h", "h", "h", "j", 
"j", "j", "j"), name = c("af", "el", "in", "ud", "ot", "an", 
"el", "un", "yt", "ip", "ot", "el", "yt", "af", "ol", "in", "in", 
"eg", "yt", "af", "ot", "el", "ip", "yt"), year = c(2000L, 2000L, 
2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2002L, 2002L, 2002L, 
2002L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 
2004L, 2004L, 2004L, 2004L), dummy = c(0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 
1L, 1L)), .Names = c("id", "name", "year", "dummy"), row.names = c(NA, 
-24L), class = "data.frame")

Answer 1

这是我的解决方案，使用 dplyr 和 tidyr 以及一个函数来识别具有 3 个或更多匹配名称的 ID：

library(dplyr)
library(tidyr)

test <- function(x){
  out2 <- sapply(1:length(x), function(j){
    out <- sapply(1:j, function(i){
      sum(x[[j]] %in% x[[i]])
    })
    out[j]<-NA
    which(out >= 3) %>% min() %>% {ifelse(is.infinite(.),NA,.)}

  })
  out2
}

DF2 <-  DF %>% group_by(id, year) %>% 
  summarise(names = list(name)) %>% ungroup() %>% 
  mutate(dummy2 = test(names)) %>% 
  mutate(year_mch = year[dummy2], 
         dummy = year_mch < year) %>% 
  unnest() 
DF2

由于无限值，它给出了一堆警告，但这并不影响结果。

Answer 2

基于 R 的一种方法：

n <- split(DF$name, DF$id)
m1 <- sapply(n, function(s1) sapply(n, function(s2) sum(s1 %in% s2) ))
diag(m1) <- 0
m1[upper.tri(m1)] <- 0
r1 <- rownames(m1)[!!rowSums(m1 > 2)]

y <- sapply(split(DF$year, DF$id), unique)
m2 <- sapply(y, function(s1) sapply(y, function(s2) +(s1 == s2) ))
diag(m2) <- 0
m2[upper.tri(m2)] <- 0
r2 <- rownames(m2)[!rowSums(m2)]

DF$dummy2 <- as.integer(DF$id %in% intersect(r1,r2))

给出：

> DF
   id name year dummy dummy2
1   c   af 2000     0      0
2   c   el 2000     0      0
3   c   in 2000     0      0
4   c   ud 2000     0      0
5   d   ot 2000     0      0
6   d   an 2000     0      0
7   d   el 2000     0      0
8   d   un 2000     0      0
9   f   yt 2002     0      0
10  f   ip 2002     0      0
11  f   ot 2002     0      0
12  f   el 2002     0      0
13  g   yt 2003     0      0
14  g   af 2003     0      0
15  g   ol 2003     0      0
16  g   in 2003     0      0
17  h   in 2003     0      0
18  h   eg 2003     0      0
19  h   yt 2003     0      0
20  h   af 2003     0      0
21  j   ot 2004     1      1
22  j   el 2004     1      1
23  j   ip 2004     1      1
24  j   yt 2004     1      1

Answer 3

与 Jaap 和 see24 类似，但使用 length(intersect(x,y)) 而不是 ==/%in% 和 rowSums/sum:

library(data.table)
setDT(DF)
idDT = unique(DF[, .(id, year)])
setkey(idDT, id)

s = split(DF$name, DF$id)

# identify pairs of ids, where id1 appears before id2 in the table
pairsDT = idDT[, CJ(id1 = id, id2 = id)[id1 < id2]]

# record whether it's strictly before 
pairsDT[, earlier := idDT[id1, x.year] < idDT[id2, x.year]]

# if it's strictly before, compare number of matching elements
pairsDT[earlier == TRUE, matched := 
  mapply(function(x, y) length(intersect(x, y)), s[id1], s[id2]) >= 3
]

dum_ids = pairsDT[matched == TRUE, unique(id2)]

然后您可以在 idDT 或 DF 中记录标准：

idDT[, dum := id %in% dum_ids]
DF[, dum := id %in% dum_ids]

在基础 R 中，可以使用 combn 完成类似的事情。我想与仅将数据存储在图形中（例如，使用 igraph 包）并从那里开始工作相比，这仍然是非常低效的。

Answer 4

我会找任何借口将数据问题转换为图形问题，所以为提出这个问题的弗兰克干杯。这是一个 igraph 解决方案。本质上，它将数据转换为有向树。所有节点仅与层次结构中更高的节点进行比较。所以 C 是树的顶部，不与其他任何东西进行比较，而 J 是终端，并与它上面的所有节点进行比较连锁，链条。要拉出层次结构中较高的所有节点，您需要做的就是使用（深度优先搜索）dfs 函数

library(tidyverse)
library(igraph)

#node list containing data specific to the group
nodelist <- DF %>%
  group_by(id, year) %>%
  nest()

#edge list containing connections. A group directly before a node points toward a future group
edgelist <- data.frame(
  from = nodelist$id %>% .[1:(length(.)-1)],
  to = nodelist$id %>% .[2:length(.)]
)

#create the data frame
g <- graph_from_data_frame(edgelist, T, nodelist)

#let's iterate through the nodes
dummy <- map_lgl(V(g)$name, function(vertex){

  #depth first search to pull out all nodes higher up on the tree 
  full_path <- dfs(g, vertex, 'in', unreachable = F) %>%
    .$order %>% 
    .[!is.na(.)] 

  #if there is no node higher up, then we're done
  if(length(full_path) <= 1) return(F)

  #The first node returned is the node we're iterating over
  this_vertex <- full_path[1]
  other_vertices <- full_path[full_path != this_vertex]

  #this is the logic for the dummy variable
  similar_groups <- map_lgl(other_vertices, function(other_vertex){
    (sum(this_vertex$data[[1]]$name %in% other_vertex$data$name) >= 3) & 
      (this_vertex$year[[1]] != other_vertex$year)
  })

  return(T %in% similar_groups)
})

V(g)$dummy2 <- dummy

as_data_frame(g, 'vertices') %>%
  unnest()

   name year dummy2 name1 dummy
1     c 2000  FALSE    af     0
2     c 2000  FALSE    el     0
3     c 2000  FALSE    in     0
4     c 2000  FALSE    ud     0
5     d 2000  FALSE    ot     0
6     d 2000  FALSE    an     0
7     d 2000  FALSE    el     0
8     d 2000  FALSE    un     0
9     f 2002  FALSE    yt     0
10    f 2002  FALSE    ip     0
11    f 2002  FALSE    ot     0
12    f 2002  FALSE    el     0
13    g 2003  FALSE    yt     0
14    g 2003  FALSE    af     0
15    g 2003  FALSE    ol     0
16    g 2003  FALSE    in     0
17    h 2003  FALSE    in     0
18    h 2003  FALSE    eg     0
19    h 2003  FALSE    yt     0
20    h 2003  FALSE    af     0
21    j 2004   TRUE    ot     1
22    j 2004   TRUE    el     1
23    j 2004   TRUE    ip     1
24    j 2004   TRUE    yt     1

Answer 5

所以这个解决方案是纯粹的基础 R。我曾经读过一篇文章，声称使用 . <- 是 %>% 的有效替代品。这是我第一次尝试。我觉得我喜欢

. <- DF[c('id', 'name', 'year')]
. <- merge(., ., by = 'name')
. <- .[.["id.x"] != .["id.y"] & .["year.x"] < .["year.y"],]
. <- .[c('id.x', 'id.y', 'year.x', 'year.y', "name")] 
.$n <- 1
. <- aggregate(n ~ id.x + id.y, data = ., sum) 
. <- .[.['n'] >= 3, 'id.y']

DF$dummy2 <- . == DF$id

Answer 6

在 OP 对速度和内存问题进行编辑后，Rcpp 方法如何：

#create a integer column out of id for non-equi join later
setDT(DF)[, nid := rleid(id)]

#convert name into an integer code
DF[DF[,.(name=unique(name))][, IntCode := .I], iname := IntCode, on=.(name)]

library(inline)
library(Rcpp)
cppFunction('
NumericVector hasOccur(NumericVector nid, NumericVector year, List iname) {
    List namelist(iname);
    int sz = namelist.size(), i, j, m, n, nPrev, nCurr, count;
    NumericVector res(sz);

    for(i=0; i<sz; i++) {
        for(j=0; j<i; j++) {
            if (nid[j] < nid[i] && year[j] < year[i]) {
                SEXP prevList = namelist[j];
                SEXP currList = namelist[i];

                NumericVector cl(currList);
                NumericVector pl(prevList);
                nPrev = pl.size();
                nCurr = cl.size();

                res[i] = 0;
                count = 0;
                for(m=0; m<nCurr; m++) {
                    for (n=0; n<nPrev; n++) {
                        if (cl[m] == pl[n]) {
                            count++;
                            break;
                        }
                    }
                }

                if (count >= 3) {
                    res[i] = 1;
                    break;
                }
            }
        }
    }

    return(res);
}')

d <- DF[, .(.(nm=iname)), by=.(nid, year)]
DF[d[, dummy := hasOccur(d$nid, d$year, d$V1)], dummy := dummy, on=.(nid, year)]

HTH.

另一种可能的data.table方法：

#create a integer column out of id for non-equi join later
setDT(DF)[, nid := rleid(id)]

          #self non-equi join
check3 <- DF[DF, .(x.id, x.name, x.year, x.nid, i.id, i.name, i.year, i.nid), on=.(nid<nid, year<year, name=name)][,
    #count the number of occurrence in previous id and year
    uniqueN(x.name, na.rm=TRUE), by=.(i.id, i.year, x.id, x.year)][,
        #check if more than 3
        any(V1 >= 3L), by=.(i.id, i.year)]

#update join to add result to original DF
DF[check3, dummy := as.integer(V1), on=c("id"="i.id", "year"="i.year")]

以分组观察中的重复为条件的虚拟变量

Dummy variable conditioned on repetitions in grouped observations

r

dummy-variable

编辑

原版post