将数据帧（使用 NA）映射到 n×n 邻接矩阵（作为 data.frame 对象）

Question

我有一个三列的dataframe对象，记录了161个国家的双边贸易数据，数据为二元格式，包含19687行，三列（记者（rid），合作伙伴（ pid), 以及他们给定年份的双边贸易流量 (TradeValue))。 rid或pid的取值范围为1到161，一个国家被赋予相同的rid和pid。对于任何给定的 (rid, pid) 对，其中 rid =/= pid, TradeValue(rid, pid) = TradeValue(pid, rid).

数据（R 中的运行）如下所示：

#load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")

head(example_data, n = 10)
   rid pid TradeValue
1    2   3        500
2    2   7       2328
3    2   8    2233465
4    2   9      81470
5    2  12     572893
6    2  17     488374
7    2  19    3314932
8    2  23      20323
9    2  25         10
10   2  29    9026220

数据来自UN Comtrade database，每个rid与多个pid配对得到他们的双边贸易数据，但可以看出，并不是每个pid 有一个数字 id 值，因为我只分配了一个 rid 或 pid 给一个国家，如果一个国家的相关经济指标列表可用，这就是为什么有 NA 在尽管 TradeValue 数据存在于该国家和报告国家 (rid) 之间。当一个国家在这种情况下成为 "reporter," 时，同样适用，该国家没有报告任何与合作伙伴的 TradeValue，并且其 ID 号不在 rid 列中。（因此，您可以看到 rid 列以 2 开头，因为国家 1（即阿富汗）没有报告任何与合作伙伴的双边贸易数据）。快速检查汇总统计数据有助于确认这一点

length(unique(example_data$rid))
[1] 139
# only 139 countries reported bilateral trade statistics with partners
length(unique(example_data$pid))
[1] 162
# that extra pid is NA (161 + NA = 162)

由于大多数国家/地区报告与合作伙伴的双边贸易数据，而对于那些不报告的国家/地区，它们往往是小型经济体。因此，我想保留 161 个国家/地区的完整列表并将此 example_data 数据框转换为 161 x 161 邻接矩阵，其中

对于 rid 列中没有的国家（例如，rid == 1），为每个国家创建一行并设置整行（在 161 x 161 矩阵中) 到 0.
对于那些不与特定 rid 共享 TradeValue 条目的国家 (pid)，将这些单元格设置为 0。

例如，假设在一个 5 x 5 的邻接矩阵中，国家 1 没有报告与合作伙伴的任何贸易统计数据，其他四个国家报告了与其他国家（国家 1 除外）的双边贸易统计数据。原始数据框就像

rid pid TradeValue
2 3 223
2 4 13
2 5 9
3 2 223
3 4 57
3 5 28
4 2 13
4 3 57
4 5 82
5 2 9
5 3 28
5 4 82

我想将其转换为 5 x 5 邻接矩阵（data.frame 格式），所需的输出应如下所示

 V1 V2 V3 V4 V5
1 0 0 0 0 0
2 0 0 223 13 9
3 0 223 0 57 28
4 0 13 57 0 82
5 0 9 28 82 0

并在 example_data 上使用相同的方法创建一个 161 x 161 的邻接矩阵。然而，在使用 reshape 和其他方法进行了几次尝试和错误之后，我仍然无法绕过这种转换，甚至没有超越第一步。

如果有人能启发我，我将不胜感激？

Answer 1

我无法读取保管箱文件，但已尝试使用您的 5 国示例数据框 -

country_num = 5

# check countries missing in rid and pid
rid_miss = setdiff(1:country_num, example_data$rid)
pid_miss = ifelse(length(setdiff(1:country_num, example_data$pid) == 0), 
                                     1, setdiff(1:country_num, example_data$pid))

# create dummy dataframe with missing rid and pid
add_data = as.data.frame(do.call(cbind, list(rid_miss, pid_miss, NA)))
colnames(add_data) = colnames(example_data)

# add dummy dataframe to original
example_data = rbind(example_data, add_data)

# the dcast now takes missing rid and pid into account
mat = dcast(example_data, rid ~ pid, value.var = "TradeValue")

# can remove first column without setting colnames but this is more failproof
rownames(mat) = mat[, 1]
mat = as.matrix(mat[, -1])

# fill in upper triangular matrix with missing values of lower triangular matrix 
# and vice-versa since TradeValue(rid, pid) = TradeValue(pid, rid)
mat[is.na(mat)] = t(mat)[is.na(mat)]

# change NAs to 0 according to preference - would keep as NA to differentiate 
# from actual zeros
mat[is.na(mat)] = 0

这有帮助吗？

将数据帧（使用 NA）映射到 n×n 邻接矩阵（作为 data.frame 对象）

Mapping a dataframe (with NA) to an n by n adjacency matrix (as a data.frame object)

r

reshape

dataframe

adjacency-matrix