查找组内两个变量之间的共现
Finding cooccurences between two variables within a group
我希望通过查找组内两个不同变量之间的共现来高效地计算共现矩阵,理想情况下无需使用遍历所有可能组合的复杂循环。
鉴于我的数据框如下所示:
df = data.frame(group = c(1,1,1,2,2,2),var1 = c(1,2,4,2,2,4),var2 = c(4,1,2,1,3,2))
> df
group var1 var2
1 1 1 4
2 1 2 1
3 1 4 2
4 2 2 1
5 2 2 3
6 2 4 2
我希望将其变成一个新的共现矩阵,其中行代表 var1,列代表 var2。
编辑:对于那些不熟悉同现的人,我对在一个组中同时出现的值对很感兴趣。例如,“2”和“1”的组合在组 1 中出现一次,在组 2 中出现其他时间,因此意味着 2 次同时出现。在我的示例中,我将组合放在两个彼此的旁边,但它们可以出现在组内的任何位置。
它应该如下所示:
> cooc
1 2 3 4
1 0 2 0 1
2 2 0 1 2
3 0 1 0 0
4 1 2 0 0
我之前在使用 xtabs 函数处理一组中仅使用一个变量的共现时已经这样做过,但不确定如何将其应用于多个列。例如,如果我有兴趣在不同组中查找 var1 的共现,我会执行以下操作:
> td = xtabs(~group + var1,data = df)
> cooc = crossprod(td,td)
> diag(cooc) = 0
如果我没有正确理解你的问题,我相信这应该有效:
# i only use data.table here in case we need to do this "by group"
# but in this solution I do not use it as i did not see the significance
# of grouping
###library(data.table)
###df <- data.table(df)
# this creates the pair of values "a_b"
df$ID <- paste(df$var1,df$var2,sep="_")
# we enumerate all the unique values that way we can create
# a map to later match the data and map
uniqval <- sort(unique(c(df$var1,df$var2)))
grid <- expand.grid(uniqval,uniqval)
grid$ID <- paste(grid$Var1,grid$Var2,sep="_")
# match our data to this map
matches <- sort(match(df$ID,grid$ID))
# tabulate our results into a dataframe
tab <- data.frame(table(grid$ID[matches]))
# split up our ID back into values
tab$Var2 <- substr(tab$Var1,3,3)
tab$Var1 <- substr(tab$Var1,1,1)
# create our empty result matrix
cooc <- matrix(0,nrow=length(uniqval),ncol=length(uniqval))
rownames(cooc) <- uniqval
colnames(cooc) <- uniqval
# there are other ways to do this
# but this seemed simple enough of a loop for me
# we just need to replace the tabulation results
# into our desired location in the matrix
# namely, "a_b" frequencies into [a,b] and [b,a] positions
for(m in 1:nrow(tab)){
i <- tab$Var1[m]
j <- tab$Var2[m]
# by adding this to the previous value
# we are accounting for "a_b" equiv. to "b_a"
cooc[i,j] <- cooc[i,j]+tab$Freq[m]
cooc[j,i] <- cooc[i,j]
}
我希望通过查找组内两个不同变量之间的共现来高效地计算共现矩阵,理想情况下无需使用遍历所有可能组合的复杂循环。
鉴于我的数据框如下所示:
df = data.frame(group = c(1,1,1,2,2,2),var1 = c(1,2,4,2,2,4),var2 = c(4,1,2,1,3,2))
> df
group var1 var2
1 1 1 4
2 1 2 1
3 1 4 2
4 2 2 1
5 2 2 3
6 2 4 2
我希望将其变成一个新的共现矩阵,其中行代表 var1,列代表 var2。
编辑:对于那些不熟悉同现的人,我对在一个组中同时出现的值对很感兴趣。例如,“2”和“1”的组合在组 1 中出现一次,在组 2 中出现其他时间,因此意味着 2 次同时出现。在我的示例中,我将组合放在两个彼此的旁边,但它们可以出现在组内的任何位置。
它应该如下所示:
> cooc
1 2 3 4
1 0 2 0 1
2 2 0 1 2
3 0 1 0 0
4 1 2 0 0
我之前在使用 xtabs 函数处理一组中仅使用一个变量的共现时已经这样做过,但不确定如何将其应用于多个列。例如,如果我有兴趣在不同组中查找 var1 的共现,我会执行以下操作:
> td = xtabs(~group + var1,data = df)
> cooc = crossprod(td,td)
> diag(cooc) = 0
如果我没有正确理解你的问题,我相信这应该有效:
# i only use data.table here in case we need to do this "by group"
# but in this solution I do not use it as i did not see the significance
# of grouping
###library(data.table)
###df <- data.table(df)
# this creates the pair of values "a_b"
df$ID <- paste(df$var1,df$var2,sep="_")
# we enumerate all the unique values that way we can create
# a map to later match the data and map
uniqval <- sort(unique(c(df$var1,df$var2)))
grid <- expand.grid(uniqval,uniqval)
grid$ID <- paste(grid$Var1,grid$Var2,sep="_")
# match our data to this map
matches <- sort(match(df$ID,grid$ID))
# tabulate our results into a dataframe
tab <- data.frame(table(grid$ID[matches]))
# split up our ID back into values
tab$Var2 <- substr(tab$Var1,3,3)
tab$Var1 <- substr(tab$Var1,1,1)
# create our empty result matrix
cooc <- matrix(0,nrow=length(uniqval),ncol=length(uniqval))
rownames(cooc) <- uniqval
colnames(cooc) <- uniqval
# there are other ways to do this
# but this seemed simple enough of a loop for me
# we just need to replace the tabulation results
# into our desired location in the matrix
# namely, "a_b" frequencies into [a,b] and [b,a] positions
for(m in 1:nrow(tab)){
i <- tab$Var1[m]
j <- tab$Var2[m]
# by adding this to the previous value
# we are accounting for "a_b" equiv. to "b_a"
cooc[i,j] <- cooc[i,j]+tab$Freq[m]
cooc[j,i] <- cooc[i,j]
}