如何按 x 以上的值过滤邻接矩阵
How to filter a adjacency matrix by values above x
我有一个相当大的邻接矩阵,只想保留彼此之间至少有 5 个事务的关系。你会怎么做?将 0 分配给所有小于 5 的值是否有意义,或者是否有更明智的方法?
然后我应该收到新的邻接矩阵,然后我怎样才能将关系作为列表输出给我,其中每个 ID 都与关联的“伙伴”一起输出。
非常感谢您的帮助:)!
到目前为止,这是我的邻接矩阵代码:
dd <- head(newdata, 50000)
colnames(dd) <- c("MEMBER_ID","AUTHOR_ID")
x <- xtabs(~MEMBER_ID+AUTHOR_ID, dd)
mm <- crossprod(x,x)
mm[lower.tri(mm, TRUE)] <- NA
这是 RStudio 中的 View()
结果。
这就是我希望我的数据集的每个 ID 对具有的内容。
为了完成,这是我的原始数据的可重现样本
SubsMain
:
# > dput(head(SubsMAIN, 100))
structure(list(MEMBER_ID = c(199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781
), RATING = c(5, 5, 5, 3, 5, 5, 4, 5, 3, 4, 5, 5, 5, 3, 4, 4,
2, 5, 5, 5, 4, 5, 5, 5, 5, 4, 5, 3, 5, 4, 5, 4, 4, 3, 3, 2, 5,
3, 5, 4, 5, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 4, 5, 5, 5, 3,
4, 4, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 4, 4, 5,
5, 4, 4, 5, 5, 4, 5, 3, 5, 3, 5, 5, 5, 2, 3, 5, 5, 3, 5, 4, 3
), AUTHOR_ID = c(258195, 201494, 409591, 1964674948, 284187,
641414, 686042, 531975, 1892323204, 362579, 301950, 2988937092,
205270, 353623, 657993, 2418118532, 590804, 222936, 216022, 2320404356,
199862, 538993, 290046, 234885, 417532, 1705021316, 216430, 1320783748,
301950, 2012450692, 3267006340, 321415, 213839, 1967230852, 519301,
1880919940, 409850, 617204, 262004, 200165, 3267006340, 345500,
1711443844, 290046, 238184, 241451, 452301, 301950, 205491, 212098,
241578, 2367524740, 2366410628, 225252, 2988937092, 1789300612,
1965068164, 432146, 2151190404, 1772130180, 290046, 203622, 210929,
243427, 205705, 301950, 2551549828, 2250674052, 1378848644, 298157,
1873186692, 526355, 231243, 2988937092, 241578, 547653, 1301319556,
1956417412, 292382, 2571341700, 421709, 2309066628, 256232, 214201,
447962, 278848, 2533396356, 328874, 1955106692, 262822, 1568706436,
458913, 217003, 583640, 307259, 199780, 1836027780, 235786, 2366279556,
358714), STATUS = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), CREATION = c("2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10"), LAST_MODIFIED = c("2001/03/24",
"2001/08/25", "2002/12/02", "2001/03/29", "2002/03/22", "2002/04/22",
"2001/01/22", "2001/11/15", "2001/04/10", "2001/03/24", "2001/04/03",
"2001/10/11", "2001/05/08", "2001/03/07", "2002/01/26", "2002/03/10",
"2001/03/24", "2001/03/25", "2001/01/28", "2001/09/06", "2001/05/22",
"2001/05/03", "2001/01/18", "2001/10/26", "2002/01/09", "2001/08/21",
"2001/02/09", "2001/03/14", "2002/03/22", "2001/03/19", "2001/02/10",
"2001/01/19", "2001/02/09", "2001/09/28", "2001/01/19", "2001/01/31",
"2001/03/19", "2001/01/31", "2001/02/09", "2001/03/07", "2001/08/10",
"2001/09/29", "2001/07/31", "2001/06/20", "2001/07/03", "2001/09/12",
"2001/03/30", "2002/05/07", "2002/08/10", "2002/02/23", "2001/09/06",
"2001/03/19", "2001/10/30", "2001/01/29", "2001/04/28", "2001/11/17",
"2002/02/23", "2001/03/15", "2001/10/28", "2001/01/31", "2001/06/12",
"2003/08/06", "2002/01/09", "2001/08/30", "2001/12/22", "2001/08/21",
"2001/04/16", "2001/11/15", "2002/05/03", "2001/03/15", "2001/08/29",
"2001/09/12", "2001/11/17", "2001/10/04", "2001/08/20", "2001/08/21",
"2001/11/17", "2003/08/06", "2001/04/03", "2001/07/22", "2001/02/11",
"2001/09/12", "2001/07/03", "2001/05/11", "2002/01/09", "2001/03/05",
"2001/07/10", "2003/06/25", "2001/02/18", "2001/03/27", "2001/06/06",
"2002/08/11", "2001/04/27", "2001/02/18", "2001/08/22", "2002/02/23",
"2001/10/30", "2001/07/03", "2001/06/04", "2003/04/28")), row.names = c(NA,
100L), class = "data.frame")
使用superior performance of data.table
,我们可以完全避免与邻接矩阵的转换。
给定一个 SubsMAIN
数据集,就像这里复制的那样
structure(list(MEMBER_ID = c(199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 301950, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781),
RATING = c(5, 5, 5, 3, 5, 5, 4, 5, 3, 4, 5, 5, 5, 3, 4, 4, 2, 5,
5, 5, 4, 5, 5, 5, 5, 4, 5, 3, 5, 4, 5, 4, 4, 3, 3, 2,
5, 3, 5, 4, 5, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 4,
5, 5, 5, 3, 4, 4, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 5,
5, 5, 5, 5, 4, 4, 5, 5, 4, 4, 5, 5, 4, 5, 3, 5, 3, 5,
5, 5, 2, 3, 5, 5, 3, 5, 4, 3),
AUTHOR_ID = c(258195, 201494, 409591, 1964674948, 284187, 641414,
686042, 531975, 1892323204, 362579, 199781,
2988937092, 205270, 353623, 657993, 2418118532,
590804, 222936, 216022, 2320404356, 199862, 538993,
290046, 234885, 417532, 1705021316, 216430,
1320783748, 301950, 2012450692, 3267006340, 321415,
213839, 1967230852, 519301, 1880919940, 409850,
617204, 262004, 200165, 3267006340, 345500,
1711443844, 290046, 238184, 241451, 452301, 301950,
205491, 212098, 241578, 2367524740, 2366410628,
225252, 2988937092, 1789300612, 1965068164, 432146,
2151190404, 1772130180, 290046, 203622, 210929,
243427, 205705, 301950, 2551549828, 2250674052,
1378848644, 298157, 1873186692, 526355, 231243,
2988937092, 241578, 547653, 1301319556, 1956417412,
292382, 2571341700, 421709, 2309066628, 256232,
214201, 447962, 278848, 2533396356, 328874,
1955106692, 262822, 1568706436, 458913, 217003,
583640, 307259, 199780, 1836027780, 235786,
2366279556, 358714),
STATUS = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L),
CREATION = c("2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10"),
LAST_MODIFIED = c("2001/03/24", "2001/08/25", "2002/12/02",
"2001/03/29", "2002/03/22", "2002/04/22",
"2001/01/22", "2001/11/15", "2001/04/10",
"2001/03/24", "2001/04/03", "2001/10/11",
"2001/05/08", "2001/03/07", "2002/01/26",
"2002/03/10", "2001/03/24", "2001/03/25",
"2001/01/28", "2001/09/06", "2001/05/22",
"2001/05/03", "2001/01/18", "2001/10/26",
"2002/01/09", "2001/08/21", "2001/02/09",
"2001/03/14", "2002/03/22", "2001/03/19",
"2001/02/10", "2001/01/19", "2001/02/09",
"2001/09/28", "2001/01/19", "2001/01/31",
"2001/03/19", "2001/01/31", "2001/02/09",
"2001/03/07", "2001/08/10", "2001/09/29",
"2001/07/31", "2001/06/20", "2001/07/03",
"2001/09/12", "2001/03/30", "2002/05/07",
"2002/08/10", "2002/02/23", "2001/09/06",
"2001/03/19", "2001/10/30", "2001/01/29",
"2001/04/28", "2001/11/17", "2002/02/23",
"2001/03/15", "2001/10/28", "2001/01/31",
"2001/06/12", "2003/08/06", "2002/01/09",
"2001/08/30", "2001/12/22", "2001/08/21",
"2001/04/16", "2001/11/15", "2002/05/03",
"2001/03/15", "2001/08/29", "2001/09/12",
"2001/11/17", "2001/10/04", "2001/08/20",
"2001/08/21", "2001/11/17", "2003/08/06",
"2001/04/03", "2001/07/22", "2001/02/11",
"2001/09/12", "2001/07/03", "2001/05/11",
"2002/01/09", "2001/03/05", "2001/07/10",
"2003/06/25", "2001/02/18", "2001/03/27",
"2001/06/06", "2002/08/11", "2001/04/27",
"2001/02/18", "2001/08/22", "2002/02/23",
"2001/10/30", "2001/07/03", "2001/06/04",
"2003/04/28")),
row.names = c(NA, 100L),
class = "data.frame")
以下data.table
解决方案
library(data.table)
# ...
# Code to generate your dataset 'SubsMAIN'.
# ...
# Set your cutoff for the minimum number of transactions.
x <- 3
# Filter 'SubsMAIN' to only those transactions for pairings that meet the cutoff.
results <- as.data.table(SubsMAIN)[
# Mark each transaction with a new ID for its pairing of 'MEMBER_ID' with
# 'AUTHOR_ID'.
, Pair_ID := .GRP,
# To make the relationship symmetric, pair by the MAX and MIN of the two
# original IDs, rather than by their column order.
by = .(pmax(MEMBER_ID, AUTHOR_ID), pmin(MEMBER_ID, AUTHOR_ID))][
# Mark each transaction with the tally of all transactions for its pair.
, Tally := .N, by = Pair_ID][
# Include only those transactions whose tallies meet the cutoff.
Tally >= x,
# Exclude the 'Tally' column, so the header is exactly like 'SubsMAIN'.
-c("Tally")]
# View results.
results
应该像这样产生 results
MEMBER_ID RATING AUTHOR_ID STATUS CREATION LAST_MODIFIED Pair_ID
1: 301950 5 199781 0 2001/01/10 2001/04/03 11
2: 199781 5 2988937092 1 2001/01/10 2001/10/11 12
3: 199781 5 290046 0 2001/01/10 2001/01/18 23
4: 199781 5 301950 0 2001/01/10 2002/03/22 11
5: 199781 5 290046 0 2001/01/10 2001/06/20 23
6: 199781 5 301950 0 2001/01/10 2002/05/07 11
7: 199781 5 2988937092 1 2001/01/10 2001/04/28 12
8: 199781 5 290046 0 2001/01/10 2001/06/12 23
9: 199781 5 301950 0 2001/01/10 2001/08/21 11
10: 199781 5 2988937092 0 2001/01/10 2001/10/04 12
其中来自 SubsMAIN
的每笔交易都会被保留,只要它属于 MEMBER_ID
和 AUTHOR_ID
的配对 (Pair_ID
) 至少有 x
交易。
备注
作为参考,以下是 Tally
列中的计数:
MEMBER_ID RATING AUTHOR_ID STATUS CREATION LAST_MODIFIED Pair_ID # Tally
1: 301950 5 199781 0 2001/01/10 2001/04/03 11 # 4
2: 199781 5 2988937092 1 2001/01/10 2001/10/11 12 # 3
3: 199781 5 290046 0 2001/01/10 2001/01/18 23 # 3
4: 199781 5 301950 0 2001/01/10 2002/03/22 11 # 4
5: 199781 5 290046 0 2001/01/10 2001/06/20 23 # 3
6: 199781 5 301950 0 2001/01/10 2002/05/07 11 # 4
7: 199781 5 2988937092 1 2001/01/10 2001/04/28 12 # 3
8: 199781 5 290046 0 2001/01/10 2001/06/12 23 # 3
9: 199781 5 301950 0 2001/01/10 2001/08/21 11 # 4
10: 199781 5 2988937092 0 2001/01/10 2001/10/04 12 # 3
看看第 1
行,MEMBER_ID
为 301950
和 AUTHOR_ID
为 199781
,如何与 [=第 29=]、6
和 9
行;每个都有相反的:199781
的 MEMBER_ID
和 301950
的 AUTHOR_ID
。也就是说,我们的Pair_ID
(此处11
)保留了here.
要求的对称性
现在因为样本 SubsMAIN
没有配对记录 5
(或更多)交易,我将截止值降低到 x <- 3
。这样,至少 一些 笔交易可以削减,并且会有一些输出要显示。
对于您的 完整 数据集,可以随意将截止值更改为 x <- 5
或任何您想要的值。
我有一个相当大的邻接矩阵,只想保留彼此之间至少有 5 个事务的关系。你会怎么做?将 0 分配给所有小于 5 的值是否有意义,或者是否有更明智的方法?
然后我应该收到新的邻接矩阵,然后我怎样才能将关系作为列表输出给我,其中每个 ID 都与关联的“伙伴”一起输出。
非常感谢您的帮助:)!
到目前为止,这是我的邻接矩阵代码:
dd <- head(newdata, 50000)
colnames(dd) <- c("MEMBER_ID","AUTHOR_ID")
x <- xtabs(~MEMBER_ID+AUTHOR_ID, dd)
mm <- crossprod(x,x)
mm[lower.tri(mm, TRUE)] <- NA
这是 RStudio 中的 View()
结果。
这就是我希望我的数据集的每个 ID 对具有的内容。
为了完成,这是我的原始数据的可重现样本
SubsMain
:
# > dput(head(SubsMAIN, 100))
structure(list(MEMBER_ID = c(199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781
), RATING = c(5, 5, 5, 3, 5, 5, 4, 5, 3, 4, 5, 5, 5, 3, 4, 4,
2, 5, 5, 5, 4, 5, 5, 5, 5, 4, 5, 3, 5, 4, 5, 4, 4, 3, 3, 2, 5,
3, 5, 4, 5, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 4, 5, 5, 5, 3,
4, 4, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 4, 4, 5,
5, 4, 4, 5, 5, 4, 5, 3, 5, 3, 5, 5, 5, 2, 3, 5, 5, 3, 5, 4, 3
), AUTHOR_ID = c(258195, 201494, 409591, 1964674948, 284187,
641414, 686042, 531975, 1892323204, 362579, 301950, 2988937092,
205270, 353623, 657993, 2418118532, 590804, 222936, 216022, 2320404356,
199862, 538993, 290046, 234885, 417532, 1705021316, 216430, 1320783748,
301950, 2012450692, 3267006340, 321415, 213839, 1967230852, 519301,
1880919940, 409850, 617204, 262004, 200165, 3267006340, 345500,
1711443844, 290046, 238184, 241451, 452301, 301950, 205491, 212098,
241578, 2367524740, 2366410628, 225252, 2988937092, 1789300612,
1965068164, 432146, 2151190404, 1772130180, 290046, 203622, 210929,
243427, 205705, 301950, 2551549828, 2250674052, 1378848644, 298157,
1873186692, 526355, 231243, 2988937092, 241578, 547653, 1301319556,
1956417412, 292382, 2571341700, 421709, 2309066628, 256232, 214201,
447962, 278848, 2533396356, 328874, 1955106692, 262822, 1568706436,
458913, 217003, 583640, 307259, 199780, 1836027780, 235786, 2366279556,
358714), STATUS = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), CREATION = c("2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10"), LAST_MODIFIED = c("2001/03/24",
"2001/08/25", "2002/12/02", "2001/03/29", "2002/03/22", "2002/04/22",
"2001/01/22", "2001/11/15", "2001/04/10", "2001/03/24", "2001/04/03",
"2001/10/11", "2001/05/08", "2001/03/07", "2002/01/26", "2002/03/10",
"2001/03/24", "2001/03/25", "2001/01/28", "2001/09/06", "2001/05/22",
"2001/05/03", "2001/01/18", "2001/10/26", "2002/01/09", "2001/08/21",
"2001/02/09", "2001/03/14", "2002/03/22", "2001/03/19", "2001/02/10",
"2001/01/19", "2001/02/09", "2001/09/28", "2001/01/19", "2001/01/31",
"2001/03/19", "2001/01/31", "2001/02/09", "2001/03/07", "2001/08/10",
"2001/09/29", "2001/07/31", "2001/06/20", "2001/07/03", "2001/09/12",
"2001/03/30", "2002/05/07", "2002/08/10", "2002/02/23", "2001/09/06",
"2001/03/19", "2001/10/30", "2001/01/29", "2001/04/28", "2001/11/17",
"2002/02/23", "2001/03/15", "2001/10/28", "2001/01/31", "2001/06/12",
"2003/08/06", "2002/01/09", "2001/08/30", "2001/12/22", "2001/08/21",
"2001/04/16", "2001/11/15", "2002/05/03", "2001/03/15", "2001/08/29",
"2001/09/12", "2001/11/17", "2001/10/04", "2001/08/20", "2001/08/21",
"2001/11/17", "2003/08/06", "2001/04/03", "2001/07/22", "2001/02/11",
"2001/09/12", "2001/07/03", "2001/05/11", "2002/01/09", "2001/03/05",
"2001/07/10", "2003/06/25", "2001/02/18", "2001/03/27", "2001/06/06",
"2002/08/11", "2001/04/27", "2001/02/18", "2001/08/22", "2002/02/23",
"2001/10/30", "2001/07/03", "2001/06/04", "2003/04/28")), row.names = c(NA,
100L), class = "data.frame")
使用superior performance of data.table
,我们可以完全避免与邻接矩阵的转换。
给定一个 SubsMAIN
数据集,就像这里复制的那样
structure(list(MEMBER_ID = c(199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 301950, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781),
RATING = c(5, 5, 5, 3, 5, 5, 4, 5, 3, 4, 5, 5, 5, 3, 4, 4, 2, 5,
5, 5, 4, 5, 5, 5, 5, 4, 5, 3, 5, 4, 5, 4, 4, 3, 3, 2,
5, 3, 5, 4, 5, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 4,
5, 5, 5, 3, 4, 4, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 5,
5, 5, 5, 5, 4, 4, 5, 5, 4, 4, 5, 5, 4, 5, 3, 5, 3, 5,
5, 5, 2, 3, 5, 5, 3, 5, 4, 3),
AUTHOR_ID = c(258195, 201494, 409591, 1964674948, 284187, 641414,
686042, 531975, 1892323204, 362579, 199781,
2988937092, 205270, 353623, 657993, 2418118532,
590804, 222936, 216022, 2320404356, 199862, 538993,
290046, 234885, 417532, 1705021316, 216430,
1320783748, 301950, 2012450692, 3267006340, 321415,
213839, 1967230852, 519301, 1880919940, 409850,
617204, 262004, 200165, 3267006340, 345500,
1711443844, 290046, 238184, 241451, 452301, 301950,
205491, 212098, 241578, 2367524740, 2366410628,
225252, 2988937092, 1789300612, 1965068164, 432146,
2151190404, 1772130180, 290046, 203622, 210929,
243427, 205705, 301950, 2551549828, 2250674052,
1378848644, 298157, 1873186692, 526355, 231243,
2988937092, 241578, 547653, 1301319556, 1956417412,
292382, 2571341700, 421709, 2309066628, 256232,
214201, 447962, 278848, 2533396356, 328874,
1955106692, 262822, 1568706436, 458913, 217003,
583640, 307259, 199780, 1836027780, 235786,
2366279556, 358714),
STATUS = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L),
CREATION = c("2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10"),
LAST_MODIFIED = c("2001/03/24", "2001/08/25", "2002/12/02",
"2001/03/29", "2002/03/22", "2002/04/22",
"2001/01/22", "2001/11/15", "2001/04/10",
"2001/03/24", "2001/04/03", "2001/10/11",
"2001/05/08", "2001/03/07", "2002/01/26",
"2002/03/10", "2001/03/24", "2001/03/25",
"2001/01/28", "2001/09/06", "2001/05/22",
"2001/05/03", "2001/01/18", "2001/10/26",
"2002/01/09", "2001/08/21", "2001/02/09",
"2001/03/14", "2002/03/22", "2001/03/19",
"2001/02/10", "2001/01/19", "2001/02/09",
"2001/09/28", "2001/01/19", "2001/01/31",
"2001/03/19", "2001/01/31", "2001/02/09",
"2001/03/07", "2001/08/10", "2001/09/29",
"2001/07/31", "2001/06/20", "2001/07/03",
"2001/09/12", "2001/03/30", "2002/05/07",
"2002/08/10", "2002/02/23", "2001/09/06",
"2001/03/19", "2001/10/30", "2001/01/29",
"2001/04/28", "2001/11/17", "2002/02/23",
"2001/03/15", "2001/10/28", "2001/01/31",
"2001/06/12", "2003/08/06", "2002/01/09",
"2001/08/30", "2001/12/22", "2001/08/21",
"2001/04/16", "2001/11/15", "2002/05/03",
"2001/03/15", "2001/08/29", "2001/09/12",
"2001/11/17", "2001/10/04", "2001/08/20",
"2001/08/21", "2001/11/17", "2003/08/06",
"2001/04/03", "2001/07/22", "2001/02/11",
"2001/09/12", "2001/07/03", "2001/05/11",
"2002/01/09", "2001/03/05", "2001/07/10",
"2003/06/25", "2001/02/18", "2001/03/27",
"2001/06/06", "2002/08/11", "2001/04/27",
"2001/02/18", "2001/08/22", "2002/02/23",
"2001/10/30", "2001/07/03", "2001/06/04",
"2003/04/28")),
row.names = c(NA, 100L),
class = "data.frame")
以下data.table
解决方案
library(data.table)
# ...
# Code to generate your dataset 'SubsMAIN'.
# ...
# Set your cutoff for the minimum number of transactions.
x <- 3
# Filter 'SubsMAIN' to only those transactions for pairings that meet the cutoff.
results <- as.data.table(SubsMAIN)[
# Mark each transaction with a new ID for its pairing of 'MEMBER_ID' with
# 'AUTHOR_ID'.
, Pair_ID := .GRP,
# To make the relationship symmetric, pair by the MAX and MIN of the two
# original IDs, rather than by their column order.
by = .(pmax(MEMBER_ID, AUTHOR_ID), pmin(MEMBER_ID, AUTHOR_ID))][
# Mark each transaction with the tally of all transactions for its pair.
, Tally := .N, by = Pair_ID][
# Include only those transactions whose tallies meet the cutoff.
Tally >= x,
# Exclude the 'Tally' column, so the header is exactly like 'SubsMAIN'.
-c("Tally")]
# View results.
results
应该像这样产生 results
MEMBER_ID RATING AUTHOR_ID STATUS CREATION LAST_MODIFIED Pair_ID
1: 301950 5 199781 0 2001/01/10 2001/04/03 11
2: 199781 5 2988937092 1 2001/01/10 2001/10/11 12
3: 199781 5 290046 0 2001/01/10 2001/01/18 23
4: 199781 5 301950 0 2001/01/10 2002/03/22 11
5: 199781 5 290046 0 2001/01/10 2001/06/20 23
6: 199781 5 301950 0 2001/01/10 2002/05/07 11
7: 199781 5 2988937092 1 2001/01/10 2001/04/28 12
8: 199781 5 290046 0 2001/01/10 2001/06/12 23
9: 199781 5 301950 0 2001/01/10 2001/08/21 11
10: 199781 5 2988937092 0 2001/01/10 2001/10/04 12
其中来自 SubsMAIN
的每笔交易都会被保留,只要它属于 MEMBER_ID
和 AUTHOR_ID
的配对 (Pair_ID
) 至少有 x
交易。
备注
作为参考,以下是 Tally
列中的计数:
MEMBER_ID RATING AUTHOR_ID STATUS CREATION LAST_MODIFIED Pair_ID # Tally
1: 301950 5 199781 0 2001/01/10 2001/04/03 11 # 4
2: 199781 5 2988937092 1 2001/01/10 2001/10/11 12 # 3
3: 199781 5 290046 0 2001/01/10 2001/01/18 23 # 3
4: 199781 5 301950 0 2001/01/10 2002/03/22 11 # 4
5: 199781 5 290046 0 2001/01/10 2001/06/20 23 # 3
6: 199781 5 301950 0 2001/01/10 2002/05/07 11 # 4
7: 199781 5 2988937092 1 2001/01/10 2001/04/28 12 # 3
8: 199781 5 290046 0 2001/01/10 2001/06/12 23 # 3
9: 199781 5 301950 0 2001/01/10 2001/08/21 11 # 4
10: 199781 5 2988937092 0 2001/01/10 2001/10/04 12 # 3
看看第 1
行,MEMBER_ID
为 301950
和 AUTHOR_ID
为 199781
,如何与 [=第 29=]、6
和 9
行;每个都有相反的:199781
的 MEMBER_ID
和 301950
的 AUTHOR_ID
。也就是说,我们的Pair_ID
(此处11
)保留了here.
现在因为样本 SubsMAIN
没有配对记录 5
(或更多)交易,我将截止值降低到 x <- 3
。这样,至少 一些 笔交易可以削减,并且会有一些输出要显示。
对于您的 完整 数据集,可以随意将截止值更改为 x <- 5
或任何您想要的值。