如何将一组观察结果与二元组相匹配?

How do I match a group of observations with a dyad?

假设我有一个数据框,其中包含名称列表以及他们作为客户的公司:

name <- c("Anne", "Anne", "Mary", "Mary", "Mary", "Joe", "Joe", "Joe", "David", "David", "David", "David", "David")
company <- c("A", "B", "C", "D", "E", "A", "B", "C", "D", "E", "F", "G", "H")

df1 <- data.frame(name, company)

然后我有第二个数据框,其中有正在合作开展项目的公司:

company1 <- c("A", "B", "C", "D", "E", "F", "G", "H")
company2 <- c("B", "C", "E", "E", "G", "A", "B", "C")

df2 <- data.frame(company1, company2)

我希望的结果是这样的:

  name      A     B     C     D     E     F     G     No of sets
1 Anne      1     1     0     0     0     0     0     1
2 David     0     0     0     1     1     1     1     1
3 Joe       1     1     1     0     0     0     0     2
4 Mary      0     0     1     1     1     0     0     1

所以这计算了与 df2 中的集合相匹配的“集合”的数量。例如,Anne 的 A 和 B 均为 1,它与 df2 中的第 1 行匹配。 Joe 有 A、B、C,并且 A 和 B 以及 B 和 C 都是 df2 中的行,因此 Joe 的行有两个匹配项。

我想这可能对你有用。让我知道。它不符合您的预期结果,因为您没有包含 H,我认为这是一个错字?同样,Mary 的 No_of_sets 也应该等于 2 吗?

# Tabulate the frequency of name x company combinations
r <- as.data.frame.matrix(table(df1$name, df1$company))
r
#>       A B C D E F G H
#> Anne  1 1 0 0 0 0 0 0
#> David 0 0 0 1 1 1 1 1
#> Joe   1 1 1 0 0 0 0 0
#> Mary  0 0 1 1 1 0 0 0

# Get "sets" of companies working together
s <- paste(df2$company1, df2$company2)
s
#> [1] "A B" "B C" "C E" "D E" "E G" "F A" "G B" "H C"

# Get all potential company sets associated with each name
m <- apply(r, MARGIN = 1, FUN = function(x) combn(names(which(x==1)), 2))

# Intersect sets of companies potentially working together (m) with
# companies actually working together (df2)
# (You could use a nested apply here, but I thought that it
# would be too opaque. Looping is a little more clear.)
for(name in rownames(r)){
  pairs <- m[[name]]
  ppairs <- apply(pairs, 2, paste0, collapse = " ")
  r[which(rownames(r)==name),"No_of_sets"] <- length(intersect(ppairs, s))
}
r
#>       A B C D E F G H No_of_sets
#> Anne  1 1 0 0 0 0 0 0          1
#> David 0 0 0 1 1 1 1 1          2
#> Joe   1 1 1 0 0 0 0 0          2
#> Mary  0 0 1 1 1 0 0 0          2

Created on 2021-10-19 by the reprex package (v2.0.1)

编辑:假设一个名字有可能不与不止一家公司合作。在这种情况下,您需要添加一个条件来在这两个步骤中说明这一点。首先,新数据...请注意名称“Solo”仅与一家公司合作。

r
#>       A B C D E F G H
#> Anne  1 1 0 0 0 0 0 0
#> David 0 0 0 1 1 1 1 1
#> Joe   1 1 1 0 0 0 0 0
#> Mary  0 0 1 1 1 0 0 0
#> Solo  1 0 0 0 0 0 0 0

m <- apply(r, MARGIN = 1, FUN = function(x)
  if(length(names(which(x==1)))>1) {
    combn(names(which(x==1)), 2)
  } else names(which(x==1))
)
m
#> $Anne
#>      [,1]
#> [1,] "A" 
#> [2,] "B" 
#> 
#> $David
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] "D"  "D"  "D"  "D"  "E"  "E"  "E"  "F"  "F"  "G"  
#> [2,] "E"  "F"  "G"  "H"  "F"  "G"  "H"  "G"  "H"  "H"  
#> 
#> $Joe
#>      [,1] [,2] [,3]
#> [1,] "A"  "A"  "B" 
#> [2,] "B"  "C"  "C" 
#> 
#> $Mary
#>      [,1] [,2] [,3]
#> [1,] "C"  "C"  "D" 
#> [2,] "D"  "E"  "E" 
#> 
#> $Solo
#> [1] "A"
for(name in rownames(r)){
  pairs <- m[[name]]
  if(length(pairs)>1){
    ppairs <- apply(pairs, 2, paste0, collapse = " ")
  } else ppairs <- pairs
  r[which(rownames(r)==name),"No_of_sets"] <- length(intersect(ppairs, s))
}
r
#>       A B C D E F G H No_of_sets
#> Anne  1 1 0 0 0 0 0 0          1
#> David 0 0 0 1 1 1 1 1          2
#> Joe   1 1 1 0 0 0 0 0          2
#> Mary  0 0 1 1 1 0 0 0          2
#> Solo  1 0 0 0 0 0 0 0          0