如何按 x 以上的值过滤邻接矩阵

How to filter a adjacency matrix by values above x

我有一个相当大的邻接矩阵,只想保留彼此之间至少有 5 个事务的关系。你会怎么做?将 0 分配给所有小于 5 的值是否有意义,或者是否有更明智的方法?

然后我应该收到新的邻接矩阵,然后我怎样才能将关系作为列表输出给我,其中每个 ID 都与关联的“伙伴”一起输出。

非常感谢您的帮助:)!

到目前为止,这是我的邻接矩阵代码:

dd <- head(newdata, 50000)
colnames(dd) <- c("MEMBER_ID","AUTHOR_ID")
x <- xtabs(~MEMBER_ID+AUTHOR_ID, dd)
mm <- crossprod(x,x)
mm[lower.tri(mm, TRUE)] <- NA

这是 RStudio 中的 View() 结果。

这就是我希望我的数据集的每个 ID 对具有的内容。

为了完成,这是我的原始数据的可重现样本 SubsMain:

# > dput(head(SubsMAIN, 100))
structure(list(MEMBER_ID = c(199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781
), RATING = c(5, 5, 5, 3, 5, 5, 4, 5, 3, 4, 5, 5, 5, 3, 4, 4, 
2, 5, 5, 5, 4, 5, 5, 5, 5, 4, 5, 3, 5, 4, 5, 4, 4, 3, 3, 2, 5, 
3, 5, 4, 5, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 4, 5, 5, 5, 3, 
4, 4, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 4, 4, 5, 
5, 4, 4, 5, 5, 4, 5, 3, 5, 3, 5, 5, 5, 2, 3, 5, 5, 3, 5, 4, 3
), AUTHOR_ID = c(258195, 201494, 409591, 1964674948, 284187, 
641414, 686042, 531975, 1892323204, 362579, 301950, 2988937092, 
205270, 353623, 657993, 2418118532, 590804, 222936, 216022, 2320404356, 
199862, 538993, 290046, 234885, 417532, 1705021316, 216430, 1320783748, 
301950, 2012450692, 3267006340, 321415, 213839, 1967230852, 519301, 
1880919940, 409850, 617204, 262004, 200165, 3267006340, 345500, 
1711443844, 290046, 238184, 241451, 452301, 301950, 205491, 212098, 
241578, 2367524740, 2366410628, 225252, 2988937092, 1789300612, 
1965068164, 432146, 2151190404, 1772130180, 290046, 203622, 210929, 
243427, 205705, 301950, 2551549828, 2250674052, 1378848644, 298157, 
1873186692, 526355, 231243, 2988937092, 241578, 547653, 1301319556, 
1956417412, 292382, 2571341700, 421709, 2309066628, 256232, 214201, 
447962, 278848, 2533396356, 328874, 1955106692, 262822, 1568706436, 
458913, 217003, 583640, 307259, 199780, 1836027780, 235786, 2366279556, 
358714), STATUS = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), CREATION = c("2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10"), LAST_MODIFIED = c("2001/03/24", 
"2001/08/25", "2002/12/02", "2001/03/29", "2002/03/22", "2002/04/22", 
"2001/01/22", "2001/11/15", "2001/04/10", "2001/03/24", "2001/04/03", 
"2001/10/11", "2001/05/08", "2001/03/07", "2002/01/26", "2002/03/10", 
"2001/03/24", "2001/03/25", "2001/01/28", "2001/09/06", "2001/05/22", 
"2001/05/03", "2001/01/18", "2001/10/26", "2002/01/09", "2001/08/21", 
"2001/02/09", "2001/03/14", "2002/03/22", "2001/03/19", "2001/02/10", 
"2001/01/19", "2001/02/09", "2001/09/28", "2001/01/19", "2001/01/31", 
"2001/03/19", "2001/01/31", "2001/02/09", "2001/03/07", "2001/08/10", 
"2001/09/29", "2001/07/31", "2001/06/20", "2001/07/03", "2001/09/12", 
"2001/03/30", "2002/05/07", "2002/08/10", "2002/02/23", "2001/09/06", 
"2001/03/19", "2001/10/30", "2001/01/29", "2001/04/28", "2001/11/17", 
"2002/02/23", "2001/03/15", "2001/10/28", "2001/01/31", "2001/06/12", 
"2003/08/06", "2002/01/09", "2001/08/30", "2001/12/22", "2001/08/21", 
"2001/04/16", "2001/11/15", "2002/05/03", "2001/03/15", "2001/08/29", 
"2001/09/12", "2001/11/17", "2001/10/04", "2001/08/20", "2001/08/21", 
"2001/11/17", "2003/08/06", "2001/04/03", "2001/07/22", "2001/02/11", 
"2001/09/12", "2001/07/03", "2001/05/11", "2002/01/09", "2001/03/05", 
"2001/07/10", "2003/06/25", "2001/02/18", "2001/03/27", "2001/06/06", 
"2002/08/11", "2001/04/27", "2001/02/18", "2001/08/22", "2002/02/23", 
"2001/10/30", "2001/07/03", "2001/06/04", "2003/04/28")), row.names = c(NA, 
100L), class = "data.frame")

使用superior performance of data.table,我们可以完全避免与邻接矩阵的转换。

给定一个 SubsMAIN 数据集,就像这里复制的那样

structure(list(MEMBER_ID = c(199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 301950, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781),
               RATING = c(5, 5, 5, 3, 5, 5, 4, 5, 3, 4, 5, 5, 5, 3, 4, 4, 2, 5,
                          5, 5, 4, 5, 5, 5, 5, 4, 5, 3, 5, 4, 5, 4, 4, 3, 3, 2,
                          5, 3, 5, 4, 5, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 4,
                          5, 5, 5, 3, 4, 4, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 5,
                          5, 5, 5, 5, 4, 4, 5, 5, 4, 4, 5, 5, 4, 5, 3, 5, 3, 5,
                          5, 5, 2, 3, 5, 5, 3, 5, 4, 3),
               AUTHOR_ID = c(258195, 201494, 409591, 1964674948, 284187, 641414,
                             686042, 531975, 1892323204, 362579, 199781,
                             2988937092, 205270, 353623, 657993, 2418118532,
                             590804, 222936, 216022, 2320404356, 199862, 538993,
                             290046, 234885, 417532, 1705021316, 216430,
                             1320783748, 301950, 2012450692, 3267006340, 321415,
                             213839, 1967230852, 519301, 1880919940, 409850,
                             617204, 262004, 200165, 3267006340, 345500,
                             1711443844, 290046, 238184, 241451, 452301, 301950,
                             205491, 212098, 241578, 2367524740, 2366410628,
                             225252, 2988937092, 1789300612, 1965068164, 432146,
                             2151190404, 1772130180, 290046, 203622, 210929,
                             243427, 205705, 301950, 2551549828, 2250674052,
                             1378848644, 298157, 1873186692, 526355, 231243,
                             2988937092, 241578, 547653, 1301319556, 1956417412,
                             292382, 2571341700, 421709, 2309066628, 256232,
                             214201, 447962, 278848, 2533396356, 328874,
                             1955106692, 262822, 1568706436, 458913, 217003,
                             583640, 307259, 199780, 1836027780, 235786,
                             2366279556, 358714),
               STATUS = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
                          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L,
                          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                          0L, 0L),
               CREATION = c("2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10"),
               LAST_MODIFIED = c("2001/03/24", "2001/08/25", "2002/12/02",
                                 "2001/03/29", "2002/03/22", "2002/04/22",
                                 "2001/01/22", "2001/11/15", "2001/04/10",
                                 "2001/03/24", "2001/04/03", "2001/10/11",
                                 "2001/05/08", "2001/03/07", "2002/01/26",
                                 "2002/03/10", "2001/03/24", "2001/03/25",
                                 "2001/01/28", "2001/09/06", "2001/05/22",
                                 "2001/05/03", "2001/01/18", "2001/10/26",
                                 "2002/01/09", "2001/08/21", "2001/02/09",
                                 "2001/03/14", "2002/03/22", "2001/03/19",
                                 "2001/02/10", "2001/01/19", "2001/02/09",
                                 "2001/09/28", "2001/01/19", "2001/01/31",
                                 "2001/03/19", "2001/01/31", "2001/02/09",
                                 "2001/03/07", "2001/08/10", "2001/09/29",
                                 "2001/07/31", "2001/06/20", "2001/07/03",
                                 "2001/09/12", "2001/03/30", "2002/05/07",
                                 "2002/08/10", "2002/02/23", "2001/09/06",
                                 "2001/03/19", "2001/10/30", "2001/01/29",
                                 "2001/04/28", "2001/11/17", "2002/02/23",
                                 "2001/03/15", "2001/10/28", "2001/01/31",
                                 "2001/06/12", "2003/08/06", "2002/01/09",
                                 "2001/08/30", "2001/12/22", "2001/08/21",
                                 "2001/04/16", "2001/11/15", "2002/05/03",
                                 "2001/03/15", "2001/08/29", "2001/09/12",
                                 "2001/11/17", "2001/10/04", "2001/08/20",
                                 "2001/08/21", "2001/11/17", "2003/08/06",
                                 "2001/04/03", "2001/07/22", "2001/02/11",
                                 "2001/09/12", "2001/07/03", "2001/05/11",
                                 "2002/01/09", "2001/03/05", "2001/07/10",
                                 "2003/06/25", "2001/02/18", "2001/03/27",
                                 "2001/06/06", "2002/08/11", "2001/04/27",
                                 "2001/02/18", "2001/08/22", "2002/02/23",
                                 "2001/10/30", "2001/07/03", "2001/06/04",
                                 "2003/04/28")),
          row.names = c(NA, 100L),
          class = "data.frame")

以下data.table解决方案

library(data.table)


# ...
# Code to generate your dataset 'SubsMAIN'.
# ...


# Set your cutoff for the minimum number of transactions.
x <- 3

# Filter 'SubsMAIN' to only those transactions for pairings that meet the cutoff.
results <- as.data.table(SubsMAIN)[
  # Mark each transaction with a new ID for its pairing of 'MEMBER_ID' with
  # 'AUTHOR_ID'.
  , Pair_ID := .GRP,
    # To make the relationship symmetric, pair by the MAX and MIN of the two
    # original IDs, rather than by their column order.
    by = .(pmax(MEMBER_ID, AUTHOR_ID), pmin(MEMBER_ID, AUTHOR_ID))][
  # Mark each transaction with the tally of all transactions for its pair.
  , Tally := .N, by = Pair_ID][
    # Include only those transactions whose tallies meet the cutoff.
    Tally >= x,
    # Exclude the 'Tally' column, so the header is exactly like 'SubsMAIN'.
    -c("Tally")]


# View results.
results

应该像这样产生 results

    MEMBER_ID RATING  AUTHOR_ID STATUS   CREATION LAST_MODIFIED Pair_ID
 1:    301950      5     199781      0 2001/01/10    2001/04/03      11
 2:    199781      5 2988937092      1 2001/01/10    2001/10/11      12
 3:    199781      5     290046      0 2001/01/10    2001/01/18      23
 4:    199781      5     301950      0 2001/01/10    2002/03/22      11
 5:    199781      5     290046      0 2001/01/10    2001/06/20      23
 6:    199781      5     301950      0 2001/01/10    2002/05/07      11
 7:    199781      5 2988937092      1 2001/01/10    2001/04/28      12
 8:    199781      5     290046      0 2001/01/10    2001/06/12      23
 9:    199781      5     301950      0 2001/01/10    2001/08/21      11
10:    199781      5 2988937092      0 2001/01/10    2001/10/04      12

其中来自 SubsMAIN 的每笔交易都会被保留,只要它属于 MEMBER_IDAUTHOR_ID 的配对 (Pair_ID) 至少有 x 交易。

备注

作为参考,以下是 Tally 列中的计数:

    MEMBER_ID RATING  AUTHOR_ID STATUS   CREATION LAST_MODIFIED Pair_ID   # Tally
 1:    301950      5     199781      0 2001/01/10    2001/04/03      11   #     4
 2:    199781      5 2988937092      1 2001/01/10    2001/10/11      12   #     3
 3:    199781      5     290046      0 2001/01/10    2001/01/18      23   #     3
 4:    199781      5     301950      0 2001/01/10    2002/03/22      11   #     4
 5:    199781      5     290046      0 2001/01/10    2001/06/20      23   #     3
 6:    199781      5     301950      0 2001/01/10    2002/05/07      11   #     4
 7:    199781      5 2988937092      1 2001/01/10    2001/04/28      12   #     3
 8:    199781      5     290046      0 2001/01/10    2001/06/12      23   #     3
 9:    199781      5     301950      0 2001/01/10    2001/08/21      11   #     4
10:    199781      5 2988937092      0 2001/01/10    2001/10/04      12   #     3

看看第 1 行,MEMBER_ID301950AUTHOR_ID199781,如何与 [=第 29=]、69 行;每个都有相反的:199781MEMBER_ID301950AUTHOR_ID。也就是说,我们的Pair_ID(此处11保留了here.

要求的对称性

现在因为样本 SubsMAIN 没有配对记录 5(或更多)交易,我将截止值降低到 x <- 3。这样,至少 一些 笔交易可以削减,并且会有一些输出要显示。

对于您的 完整 数据集,可以随意将截止值更改为 x <- 5 或任何您想要的值。