如何使用 R 从具有多列的数据框中计算(共)现矩阵?
How to calculate a (co-)occurrence matrix from a data frame with several columns using R?
我是 R 的新手,目前正在处理具有 32 列和大约 200.000 行的边缘列表形式的协作数据。我想根据国家之间的相互作用创建一个(共)现矩阵。但是,我想通过一个对象的总数来统计交互次数。
期望结果的基本示例
如果在一行中"England"出现了三次而"China"只出现了一次,那么结果应该是下面的矩阵。
England China
England 3 3
China 3 1
可重现的例子
df <- data.frame(ID = c(1,2,3,4),
V1 = c("England", "England", "China", "England"),
V2 = c("Greece", "England", "Greece", "England"),
V32 = c("USA", "China", "Greece", "England"))
因此,当前的示例数据框如下所示:
ID V1 V2 ... V32
1 England Greece USA
2 England England China
3 China Greece Greece
4 England England England
.
.
.
期望的结果
我想按行计算(共)现并独立于顺序来获得一个(共)现矩阵,该矩阵考虑了边循环的低频率(例如英格兰-英格兰),这导致以下结果结果:
China England Greece USA
China 2 2 2 0
England 2 6 1 1
Greece 2 1 3 1
USA 0 1 1 1
到目前为止尝试了什么
我已经使用 igraph
来获得具有共现的邻接矩阵。但是,它计算 - 正如预期的那样 - 相同两个对象的相互作用不超过两次,在某些情况下,我得到的值远低于对象的实际频率 row/publication。
df <- data.frame(ID = c(1,2,3,4),
V1 = c("England", "England", "China", "England"),
V2 = c("Greece", "England", "Greece", "England"),
V32 = c("USA", "China", "Greece", "England"))
# remove ID column
df[1] <- list(NULL)
# calculate co-occurrences and return as dataframe
library(igraph)
library(Matrix)
countrydf <- graph.data.frame(df)
countrydf2 <- as_adjacency_matrix(countrydf, type = "both", edges = FALSE)
countrydf3 <- as.data.frame(as.matrix(forceSymmetric(countrydf2)))
China England Greece USA
China 0 0 1 0
England 0 2 1 0
Greece 1 1 0 0
USA 0 0 0 0
我假设必须有一个简单的解决方案,使用 base
and/or dplyr
和/或 table
and/or reshape2
类似于[1], [2], [3], or [5] but nothing has done the trick so far and I was not able to adjust the code to my needs. I've also tried to use [6] 作为基础,但是,同样的问题也适用于此。
library(tidry)
library(dplyr)
library(stringr)
# collapse observations into one column
df2 <- df %>% unite(concat, V1:V32, sep = ",")
# calculate weights
df3 <- df2$concat %>%
str_split(",") %>%
lapply(function(x){
expand.grid(x,x,x,x, w = length(x), stringsAsFactors = FALSE)
}) %>%
bind_rows
df4 <- apply(df3[, -5], 1, sort) %>%
t %>%
data.frame(stringsAsFactors = FALSE) %>%
mutate(w = df3$w)
如果有人能给我指出正确的方向,我会很高兴。
可能有更好的方法,但请尝试:
library(tidyverse)
df1 <- df %>%
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
xtabs(~ID + Country, data = ., sparse = FALSE) %>%
crossprod(., .)
df_diag <- df %>%
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
mutate(Country2 = Country) %>%
xtabs(~Country + Country2, data = ., sparse = FALSE) %>%
diag()
diag(df1) <- df_diag
df1
Country China England Greece USA
China 2 2 2 0
England 2 6 1 1
Greece 2 1 3 1
USA 0 1 1 1
这是一种使用 dplyr 和 tidyr 包的方法,整个想法在于创建一个数据框,每个国家按行出现,然后将其加入自身。
library(dplyr)
# Create dataframe sammple
df <- data.frame(ID = c(1,2,3,4),
V1 = c("England", "England", "China", "England"),
V2 = c("Greece", "England", "Greece", "England"),
V32 = c("USA", "China", "Greece", "England"),
stringsAsFactors = FALSE)
# Get the occurance of each country in every row.
row_occurance <-
df %>%
tidyr::gather(key = "identifier", value = "country", -ID) %>%
group_by(ID, country) %>%
count()
row_occurance %>%
# Join row_occurance on itself to simulate the matrix
left_join(row_occurance, by = "ID") %>%
# Get the highest occurance row wise, this to handle when country
# name is repeated within same row
mutate(Occurance = pmax(n.x, n.y)) %>%
# Group by 2 countries
group_by(country.x, country.y) %>%
# Sum the occurance of 2 countries together
summarise(Occurance = sum(Occurance)) %>%
# Spread the data to make it in matrix format
tidyr::spread(key = "country.y", value = "Occurance", fill = 0)
# # A tibble: 4 x 5
# # Groups: country.x [4]
# country.x China England Greece USA
# <chr> <dbl> <dbl> <dbl> <dbl>
# China 2 2 2 0
# England 2 6 1 1
# Greece 2 1 3 1
# USA 0 1 1 1
使用base::table
的选项:
df <- data.frame(ID = c(1,2,3,4),
V1 = c("England", "England", "China", "England"),
V2 = c("Greece", "England", "Greece", "England"),
V3 = c("USA", "China", "Greece", "England"))
#get paired combi and remove those from same country
pairs <- as.data.frame(do.call(rbind,
by(df, df$ID, function(x) t(combn(as.character(x[-1L]), 2L)))))
pairs <- pairs[pairs$V1!=pairs$V2, ]
#repeat data frame with columns swap so that
#upper and lower tri have same numbers and all countries are shown
pairs <- rbind(pairs, data.frame(V1=pairs$V2, V2=pairs$V1))
#tabulate pairs
tab <- table(pairs)
#set diagonals to be the count of countries
cnt <- c(table(unlist(df[-1L])))
diag(tab) <- cnt[names(diag(tab))]
tab
输出:
V2
V1 China England Greece USA
China 2 2 2 0
England 2 6 1 1
Greece 2 1 3 1
USA 0 1 1 1
我是 R 的新手,目前正在处理具有 32 列和大约 200.000 行的边缘列表形式的协作数据。我想根据国家之间的相互作用创建一个(共)现矩阵。但是,我想通过一个对象的总数来统计交互次数。
期望结果的基本示例
如果在一行中"England"出现了三次而"China"只出现了一次,那么结果应该是下面的矩阵。
England China
England 3 3
China 3 1
可重现的例子
df <- data.frame(ID = c(1,2,3,4),
V1 = c("England", "England", "China", "England"),
V2 = c("Greece", "England", "Greece", "England"),
V32 = c("USA", "China", "Greece", "England"))
因此,当前的示例数据框如下所示:
ID V1 V2 ... V32
1 England Greece USA
2 England England China
3 China Greece Greece
4 England England England
.
.
.
期望的结果
我想按行计算(共)现并独立于顺序来获得一个(共)现矩阵,该矩阵考虑了边循环的低频率(例如英格兰-英格兰),这导致以下结果结果:
China England Greece USA
China 2 2 2 0
England 2 6 1 1
Greece 2 1 3 1
USA 0 1 1 1
到目前为止尝试了什么
我已经使用 igraph
来获得具有共现的邻接矩阵。但是,它计算 - 正如预期的那样 - 相同两个对象的相互作用不超过两次,在某些情况下,我得到的值远低于对象的实际频率 row/publication。
df <- data.frame(ID = c(1,2,3,4),
V1 = c("England", "England", "China", "England"),
V2 = c("Greece", "England", "Greece", "England"),
V32 = c("USA", "China", "Greece", "England"))
# remove ID column
df[1] <- list(NULL)
# calculate co-occurrences and return as dataframe
library(igraph)
library(Matrix)
countrydf <- graph.data.frame(df)
countrydf2 <- as_adjacency_matrix(countrydf, type = "both", edges = FALSE)
countrydf3 <- as.data.frame(as.matrix(forceSymmetric(countrydf2)))
China England Greece USA
China 0 0 1 0
England 0 2 1 0
Greece 1 1 0 0
USA 0 0 0 0
我假设必须有一个简单的解决方案,使用 base
and/or dplyr
和/或 table
and/or reshape2
类似于[1], [2], [3],
library(tidry)
library(dplyr)
library(stringr)
# collapse observations into one column
df2 <- df %>% unite(concat, V1:V32, sep = ",")
# calculate weights
df3 <- df2$concat %>%
str_split(",") %>%
lapply(function(x){
expand.grid(x,x,x,x, w = length(x), stringsAsFactors = FALSE)
}) %>%
bind_rows
df4 <- apply(df3[, -5], 1, sort) %>%
t %>%
data.frame(stringsAsFactors = FALSE) %>%
mutate(w = df3$w)
如果有人能给我指出正确的方向,我会很高兴。
可能有更好的方法,但请尝试:
library(tidyverse)
df1 <- df %>%
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
xtabs(~ID + Country, data = ., sparse = FALSE) %>%
crossprod(., .)
df_diag <- df %>%
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
mutate(Country2 = Country) %>%
xtabs(~Country + Country2, data = ., sparse = FALSE) %>%
diag()
diag(df1) <- df_diag
df1
Country China England Greece USA
China 2 2 2 0
England 2 6 1 1
Greece 2 1 3 1
USA 0 1 1 1
这是一种使用 dplyr 和 tidyr 包的方法,整个想法在于创建一个数据框,每个国家按行出现,然后将其加入自身。
library(dplyr)
# Create dataframe sammple
df <- data.frame(ID = c(1,2,3,4),
V1 = c("England", "England", "China", "England"),
V2 = c("Greece", "England", "Greece", "England"),
V32 = c("USA", "China", "Greece", "England"),
stringsAsFactors = FALSE)
# Get the occurance of each country in every row.
row_occurance <-
df %>%
tidyr::gather(key = "identifier", value = "country", -ID) %>%
group_by(ID, country) %>%
count()
row_occurance %>%
# Join row_occurance on itself to simulate the matrix
left_join(row_occurance, by = "ID") %>%
# Get the highest occurance row wise, this to handle when country
# name is repeated within same row
mutate(Occurance = pmax(n.x, n.y)) %>%
# Group by 2 countries
group_by(country.x, country.y) %>%
# Sum the occurance of 2 countries together
summarise(Occurance = sum(Occurance)) %>%
# Spread the data to make it in matrix format
tidyr::spread(key = "country.y", value = "Occurance", fill = 0)
# # A tibble: 4 x 5
# # Groups: country.x [4]
# country.x China England Greece USA
# <chr> <dbl> <dbl> <dbl> <dbl>
# China 2 2 2 0
# England 2 6 1 1
# Greece 2 1 3 1
# USA 0 1 1 1
使用base::table
的选项:
df <- data.frame(ID = c(1,2,3,4),
V1 = c("England", "England", "China", "England"),
V2 = c("Greece", "England", "Greece", "England"),
V3 = c("USA", "China", "Greece", "England"))
#get paired combi and remove those from same country
pairs <- as.data.frame(do.call(rbind,
by(df, df$ID, function(x) t(combn(as.character(x[-1L]), 2L)))))
pairs <- pairs[pairs$V1!=pairs$V2, ]
#repeat data frame with columns swap so that
#upper and lower tri have same numbers and all countries are shown
pairs <- rbind(pairs, data.frame(V1=pairs$V2, V2=pairs$V1))
#tabulate pairs
tab <- table(pairs)
#set diagonals to be the count of countries
cnt <- c(table(unlist(df[-1L])))
diag(tab) <- cnt[names(diag(tab))]
tab
输出:
V2
V1 China England Greece USA
China 2 2 2 0
England 2 6 1 1
Greece 2 1 3 1
USA 0 1 1 1