找到对之间最常见的组合

Find most common combination between pairs

我有一份活动和出席这些活动的嘉宾名单。像这样,但文件更大:

event       guests
birthday    John Doe
birthday    Jane Doe
birthday    Mark White
wedding     John Doe
wedding     Jane Doe
wedding     Matthew Green
bar mitzvah Janet Black
bar mitzvah John Doe
bar mitzvah Jane Doe
bar mitzvah William Hill
retirement  Janet Black
retirement  Matthew Green

我想找出一起参加最多活动的两位嘉宾的最常见组合。所以在这个例子中,答案应该是 John DoeJane Doe 一起参加最多的活动,因为他们都参加了三个相同的活动。输出应该是这些对的列表。

我从哪里开始?

根据你的陈述 "attend the most events together" 我假设你所说的相似性是指 intersect

您可以使用以下代码找到事件 ~ 名称之间的交集:

# All names that we have
nameAll <- unique(df$guests)
# Length of names vector
N <- length(nameAll)

# Function to find intersect between names
getSimilarity <- function(nameA, nameB, type = "intersect") {
    # Subset events for name A
    eventA <- subset(df, guests == nameA)$event
    # Subset events for name B
    eventB <- subset(df, guests == nameB)$event
    # Fint intersect length between events
    if (type == "intersect") {
        res <- length(intersect(eventA, eventB))
    }
    # Find Jaccard index between events
    if (type == "JC") {
        res <- length(intersect(eventA, eventB)) / length(union(eventA, eventB))
    }
    # Return result
    return(data.frame(type, value = res, nameA, nameB))
}

# Iterate over all possible combinations
# Using double loop for simpler representation    
result <- list()
for(i in 1:(N-1)) {
    for(j in (i+1):N) {
        result[[length(result) + 1]] <- getSimilarity(nameAll[i], nameAll[j])
    }
}
# Transform result to data.frame and order by similarity 
result <- do.call(rbind, result)
# Showing top 5 pairs
head(result[with(result, order(-value)), ])
       type value    nameA         nameB
1 intersect     3 John Doe      Jane Doe
2 intersect     1 John Doe    Mark White
3 intersect     1 John Doe Matthew Green
4 intersect     1 John Doe   Janet Black
5 intersect     1 John Doe  William Hill

Jaccard 也给出相同的结果:

   type     value       nameA        nameB
1    JC 1.0000000    John Doe     Jane Doe
15   JC 0.5000000 Janet Black William Hill
2    JC 0.3333333    John Doe   Mark White
5    JC 0.3333333    John Doe William Hill
6    JC 0.3333333    Jane Doe   Mark White

数据(df):

structure(list(event = c("birthday", "birthday", "birthday", 
"wedding", "wedding", "wedding", "bar mitzvah", "bar mitzvah", 
"bar mitzvah", "bar mitzvah", "retirement", "retirement"), guests = c("John Doe", 
"Jane Doe", "Mark White", "John Doe", "Jane Doe", "Matthew Green", 
"Janet Black", "John Doe", "Jane Doe", "William Hill", "Janet Black", 
"Matthew Green")), .Names = c("event", "guests"), row.names = c(NA, 
-12L), class = "data.frame")

从社会 networks/matrix 代数的角度来看略有不同的方法:

您的数据通过共享成员描述个人之间的联系。这是一个从属关系矩阵,我们可以计算个人 $i$ 和 $j$ 之间的联系矩阵,如下所示:

# Load as a data frame
df <- data.frame(event = c(rep("birthday", 3), 
                           rep("wedding", 3), 
                           rep("bar mitzvah", 4), 
                           rep("retirement", 2)), 
                  guests = c("John Doe", "Jane Doe", "Mark White", 
                             "John Doe", "Jane Doe", "Matthew Green",   
                              "Janet Black", "John Doe", "Jane Doe",
                              "William Hill", "Janet Black", "Matthew Green"))

# You can represent who attended which event as a matrix
M <- table(df$guests, df$event)
# Now we can compute how many times each individual appeared at an
# event with another with a simple matrix product
admat <- M %*% t(M)
admat


  ##################Jane Doe Janet Black John Doe Mark White Matthew Green William Hill
  #Jane Doe             3           1        3          1             1            1
  #Janet Black          1           2        1          0             1            1
  #John Doe             3           1        3          1             1            1
  #Mark White           1           0        1          1             0            0
  #Matthew Green        1           1        1          0             2            0
  #William Hill         1           1        1          0             0            1

现在我们要去掉矩阵的对角线(它告诉我们每个人参加了多少事件)和矩阵的两个三角形之一,其中包含冗余信息。

diag(admat) <- 0
admat[upper.tri(admat)] <- 0

现在我们只想转换成您可能喜欢的格式。我将使用 reshape2 库中的 melt 函数。

library(reshape2)
dfmatches <- unique(melt(admat))
# Drop all the zero matches
dfmatches <- dfmatches[dfmatches$value !=0,]
# order it descending
dfmatches <- dfmatches[order(-dfmatches$value),]
dfmatches

#            Var1        Var2 value
#3       John Doe    Jane Doe     3
#2    Janet Black    Jane Doe     1
#4     Mark White    Jane Doe     1
#5  Matthew Green    Jane Doe     1
#6   William Hill    Jane Doe     1
#9       John Doe Janet Black     1
#11 Matthew Green Janet Black     1
#12  William Hill Janet Black     1
#16    Mark White    John Doe     1
#17 Matthew Green    John Doe     1
#18  William Hill    John Doe     1

显然,您可以通过重命名感兴趣的变量等来整理输出。

这种一般方法——我的意思是认识到你的数据描述了一个社交网络——你可能会对进一步分析感兴趣(例如,如果人们参加很多聚会,他们可能会有意义地联系在一起同一个人,即使彼此不在一起)。如果您的数据集真的很大,您可以通过使用稀疏矩阵或通过加载 igraph 包并使用其中的函数来声明社交网络来使矩阵代数更快一些。

我认为这里的答案很棒。我只是想分享一些想法。如果您正在处理大型数据集,有很多客人或很多事件。许多条件都是可能的。例如,两位以上的客人都参加了最多的同一活动,或者两组客人参加了两个不同的活动,但总人数相同。如果是这样的话,找到前两位客人可能还不够。

在这里我想演示使用层次聚类来找到相似的客人或组。

我们可以先用1和0构造一个矩阵,1表示出席,0表示没有出席。

library(tidyverse)
library(vegan)

dat_m <- dat %>%
  mutate(value = 1) %>%
  spread(event, value, fill = 0) %>%
  column_to_rownames(var = "guests") %>%
  as.matrix()

dat_m
#               bar mitzvah birthday retirement wedding
# Jane Doe                1        1          0       1
# Janet Black             1        0          1       0
# John Doe                1        1          0       1
# Mark White              0        1          0       0
# Matthew Green           0        0          1       1
# William Hill            1        0          0       0

然后我们可以计算出每位客人的距离。请注意,我使用了 vegan 包中的 vegdist 函数并设置了 binary = TRUE 因为我们正在处理二进制数据。

dat_dist <- vegdist(dat_m, binary = TRUE)

dat_dist
#                Jane Doe Janet Black  John Doe Mark White Matthew Green
# Janet Black   0.6000000                                               
# John Doe      0.0000000   0.6000000                                   
# Mark White    0.5000000   1.0000000 0.5000000                         
# Matthew Green 0.6000000   0.5000000 0.6000000  1.0000000              
# William Hill  0.5000000   0.3333333 0.5000000  1.0000000     1.0000000

然后我们可以进行层次聚类并查看结果。

hc <- hclust(dat_dist)
plot(hc)

根据树状图,Jane DoeJohn Doe 最相似,作为一个组,它们与其他组的差异最大。

我们还可以检查 Jane DoeJohn Doe 参加的活动次数最多。所以我们知道我们可以 select 这两个。

rowSums(dat_m)
# Jane Doe   Janet Black      John Doe    Mark White Matthew Green  William Hill 
#        3             2             3             1             2             1 

同样,我认为其他人的答案更直接,并为您提供此示例数据集的输出,但如果您正在处理更大的数据集。层次聚类可能是一种选择。