R - 使用 apply 加速搜索
R - Speeding up a search with apply
假设我有一个 data.table
,其中观察结果是我的消费者一起购买的产品的成对组合。
我想知道,对于我的 data.table
中的每一对产品(dt
中的一行),他们是否有第三种共同的产品,有时也与其中一种产品一起购买产品。
我想在 dt
中将“常用产品”作为新列包括在内。
目前,我是这样操作的。但我的真实数据拥有数百万行。从1周开始计算数据需要20个小时。
我怎样才能加快速度? apply
函数是智能的,还是我应该考虑映射?
模拟示例:
library(data.table)
library(stringi)
library(future.apply)
set.seed(1)
# build mock data
dt <- data.table(V1 = stri_rand_strings(100, 1),
V2 = stri_rand_strings(100, 1))
head(dt,17)
# V1 V2
#1: G e
#2: N L
#3: Z G
#4: u z
#5: C d
#6: t D
# 7: w 8
# 8: e T
# 9: d v
#10: 3 b
#11: C y
#12: A j
#13: g M
#14: N Q
#15: l 9
#16: U 0
#17: i i
#function to find common products
find_products <- function(a, b){
library(data.table)
toString(unique((dt[.(c(a, b)), on=.(V1), V2[duplicated(V2)]])))
}
#initiate parallel processing
plan(multisession) # on Windows machine - use plan(multicore) on Linux
#apply function across rows
common_products <- future_apply(dt, 1, function(y) find_products(y['V1'], y['V2']))
dt_final <- cbind(dt, common_products)
#head(dt, 17)
# V1 V2 common_products
# 1: G e
# 2: N L
# 3: Z G
# 4: u z
# 5: C d
# 6: t D
# 7: w 8
# 8: e T
# 9: d v
#10: 3 b
#11: C y
#12: A j
#13: g M
#14: N Q
#15: l 9
#16: U 0
#17: i i i, z, B, l
更新 3(因为 ego
在大图中崩溃)
Update 2 中的 ego
方法似乎由于网络太大而崩溃,我们可能可以按行方式尝试
g <- simplify(graph_from_data_frame(dt, directed = FALSE))
nms <- names(V(g))
dt[
,
common := toString(nms[do.call(intersect, ego(g, 1, unlist(.(V1, V2)), mindist = 1))]),
.(id = seq_along(V1))
]
更新 2(解决使用 igraph
时的崩溃问题)
考虑到在大的igraph
对象上使用triangles
或subgraph_isomorphisms
可能导致的崩溃,我们可以使用下面的代码(可能会牺牲一些速度但应该是稳定的解决方法)
g <- simplify(graph_from_data_frame(dt, directed = FALSE))
nms <- names(V(g))
transform(
dt,
common = Map(
function(...) nms[intersect(...)],
ego(g, 1, V1, mindist = 1),
ego(g, 1, V2, mindist = 1)
)
)
更新
我想你可以尝试 subgraph_isomorphisms
找到所有具有所有排列的三角形并合并到 dt
g <- simplify(graph_from_data_frame(dt, directed = FALSE))
dtg <- as.data.table(do.call(rbind, Map(names, subgraph_isomorphisms(make_ring(3), g))))
dtg[
dt,
on = .(V1 = V1, V2 = V2)
][
,
.(common = toString(na.omit(V3))), .(V1, V2)
][
V1 == V2,
common := sapply(V1, function(x) toString(names(neighbors(g, x))))
][]
哪个应该很快并且可以给你
V1 V2 common
1: G e
2: N L
3: Z G
4: u z
5: C d B
6: t D
7: w 8
8: e T
9: d v
10: 3 b
11: C y
12: A j
13: g M
14: N Q
15: l 9
16: U 0
17: i i l, z, 7, B, 5, E
18: z 6
19: N R S
20: m d
21: v z
22: D U
23: e U
24: 7 A
25: G k
26: N S R
27: 0 V
28: N C
29: r E
30: L a
31: T Z
32: b 4
33: U 2
34: B d C, 6
35: p v
36: f b
37: n Y
38: 6 W j
39: i z
40: P V
41: o g
42: e b
43: m E
44: Y G
45: W j 6
46: m S
47: 1 A
48: T k
49: j 6 W
50: g r
51: T c
52: r Y
53: R K
54: F S
55: 4 V
56: 6 B d
57: J W
58: W 4
59: f H
60: P D
61: u H
62: I t
63: S R N
64: K m
65: e s
66: F P
67: T 3
68: l K
69: 5 i
70: s K O
71: L d r
72: q q P
73: L r d
74: K O s
75: T N
76: t t D, I, x
77: r d L
78: O j
79: m b
80: x t
81: Q I
82: i B
83: O s K
84: K V
85: k s
86: C B d
87: i l
88: 7 i
89: F w
90: 8 X
91: E i
92: 3 O
93: d 6 B
94: s v
95: m H
96: n a
97: S 6
98: P q
99: o J
100: b m
V1 V2 common
上一个答案
也许这会有所帮助
library(igraph)
g <- simplify(graph_from_data_frame(dt, directed = FALSE))
dcast(
melt(dt[, id := 1:.N], "id")[
,
common := toString(names(V(g))[do.call(intersect, ego(g, nodes = value, mindist = 1))]),
id
],
id + common ~ variable
)[, .(V1, V2, common)]
这给出了
V1 V2 common
1: G e
2: N L
3: Z G
4: u z
5: C d B
6: t D
7: w 8
8: e T
9: d v
10: 3 b
11: C y
12: A j
13: g M
14: N Q
15: l 9
16: U 0
17: i i l, z, 7, B, 5, E
18: z 6
19: N R S
20: m d
21: v z
22: D U
23: e U
24: 7 A
25: G k
26: N S R
27: 0 V
28: N C
29: r E
30: L a
31: T Z
32: b 4
33: U 2
34: B d C, 6
35: p v
36: f b
37: n Y
38: 6 W j
39: i z
40: P V
41: o g
42: e b
43: m E
44: Y G
45: W j 6
46: m S
47: 1 A
48: T k
49: j 6 W
50: g r
51: T c
52: r Y
53: R K
54: F S
55: 4 V
56: 6 B d
57: J W
58: W 4
59: f H
60: P D
61: u H
62: I t
63: S R N
64: K m
65: e s
66: F P
67: T 3
68: l K
69: 5 i
70: s K O
71: L d r
72: q q P
73: L r d
74: K O s
75: T N
76: t t D, I, x
77: r d L
78: O j
79: m b
80: x t
81: Q I
82: i B
83: O s K
84: K V
85: k s
86: C B d
87: i l
88: 7 i
89: F w
90: 8 X
91: E i
92: 3 O
93: d 6 B
94: s v
95: m H
96: n a
97: S 6
98: P q
99: o J
100: b m
V1 V2 common
可以将 three-way 对视为无向图中的三角形。包 igraph
可以有效地找到这些。我包括了一个包含 20M 对 3 字符产品代码的示例。它 运行 在大约 70 秒内在单个线程上运行。
library(data.table)
library(stringi)
library(igraph)
getpaired <- function(g, id) names(unique(neighbors(g, id)))
commonProducts <- function(dt) {
blnSort <- dt$V1 > dt$V2
dt[blnSort, c("V2", "V1") := list(V1, V2)] # sort each row
# get triangles
g <- graph_from_data_frame(dt, FALSE)
m <- matrix(V(g)$name[triangles(g)], ncol = 3, byrow = TRUE)
# sort each row
m <- matrix(m[order(row(m), m, method = "radix")], ncol = 3, byrow = TRUE)
dt3 <- as.data.table(m)
# map common products back to the original dataframe
dt3 <- rbindlist(
list(
# the three ordered pairs in each triangle
dt3,
dt3[, c(1, 3, 2)],
dt3[, c(2, 3, 1)],
# common products in "two-sided" triangles
dt[V1 == V2][
, .(V2 = V2, V3 = .(getpaired(g, V1))), by = "V1"
][
, .(V1 = rep(rep.int(V1, lengths(V3)), 2),
V2 = c(rep.int(V1, lengths(V3)), unlist(V3)),
V3 = c(unlist(V3), rep.int(V1, lengths(V3))))
][ # sort (V1, V2) in each row
V1 > V2, c("V2", "V1") := list(V1, V2)
]
),
FALSE # bind by index
)[ # collapse common products into a single vector for each pair
, .(V3 = .(V3)),
by = c("V1", "V2")
][ # join into the original (row-sorted) data.table
dt, on = c("V1", "V2")
][ # unsort V1, V2 in each row to match the original (unsorted) data.table
, c("V1", "V2") := dt[blnSort, c("V2", "V1") := list(V1, V2)]
]
}
set.seed(1)
# build mock data
dt <- data.table(V1 = stri_rand_strings(100, 1),
V2 = stri_rand_strings(100, 1))
dt3 <- commonProducts(dt)
print(dt3)
#> V1 V2 V3
#> 1: G e
#> 2: N L
#> 3: Z G
#> 4: u z
#> 5: C d B
#> 6: t D t
#> 7: w 8
#> 8: e T
#> 9: d v
#> 10: 3 b
#> 11: C y
#> 12: A j
#> 13: g M
#> 14: N Q
#> 15: l 9
#> 16: U 0
#> 17: i i i,7,E,B,5,z,...
#> 18: z 6
#> 19: N R S
#> 20: m d
#> 21: v z
#> 22: D U
#> 23: e U
#> 24: 7 A
#> 25: G k
#> 26: N S R
#> 27: 0 V
#> 28: N C
#> 29: r E
#> 30: L a
#> 31: T Z
#> 32: b 4
#> 33: U 2
#> 34: B d C,6
#> 35: p v
#> 36: f b
#> 37: n Y
#> 38: 6 W j
#> 39: i z i
#> 40: P V
#> 41: o g
#> 42: e b
#> 43: m E
#> 44: Y G
#> 45: W j
#> 46: m S
#> 47: 1 A
#> 48: T k
#> 49: j 6 W
#> 50: g r
#> 51: T c
#> 52: r Y
#> 53: R K
#> 54: F S
#> 55: 4 V
#> 56: 6 B d
#> 57: J W
#> 58: W 4
#> 59: f H
#> 60: P D
#> 61: u H
#> 62: I t t
#> 63: S R N
#> 64: K m
#> 65: e s
#> 66: F P
#> 67: T 3
#> 68: l K
#> 69: 5 i i
#> 70: s K O
#> 71: L d
#> 72: q q P,q,q
#> 73: L r d
#> 74: K O s
#> 75: T N
#> 76: t t D,I,t,x,t
#> 77: r d L
#> 78: O j
#> 79: m b
#> 80: x t t
#> 81: Q I
#> 82: i B i
#> 83: O s K
#> 84: K V
#> 85: k s
#> 86: C B d
#> 87: i l i
#> 88: 7 i i
#> 89: F w
#> 90: 8 X
#> 91: E i i
#> 92: 3 O
#> 93: d 6 B
#> 94: s v
#> 95: m H
#> 96: n a
#> 97: S 6
#> 98: P q q
#> 99: o J
#> 100: b m
#> V1 V2 V3
# timing a much larger dataset
dt <- data.table(V1 = stri_rand_strings(2e7, 3),
V2 = stri_rand_strings(2e7, 3))
system.time(dt3 <- commonProducts(dt))
#> user system elapsed
#> 72.75 3.05 71.88
dt3[lengths(V3) != 0L] # show only those pairs with common products
#> V1 V2 V3
#> 1: GBW mDN lxF
#> 2: ix6 jpR 0VI
#> 3: xLG VeE aik
#> 4: A36 RzJ YYu
#> 5: zAo OYu zAo
#> ---
#> 1841567: qX9 xrW 7lb
#> 1841568: knO 2G6 knO
#> 1841569: rsU 5Rw ER8
#> 1841570: Bts 3L1 1bQ
#> 1841571: c0h pgd jxJ
这处理 V1
==V2
时创建的“双边三角形”(与 OP 示例数据中的第 17 行一样)。例如,如果整个数据集由 (t, i)
和 (i, i)
对组成,那么 i
将是 (t, i)
的共同产品(i
与两者配对t
和 i
),i, t
将是 (i, i)
的常见产品(i
和 t
分别与 i
和 i
).
我相信你可以坚持data.table来完成你想做的事情。正如评论者已经指出的那样,您需要确定产品配对 [a,b] 是否等同于 [b,a](在您的示例答案中,它仅适用于配对 [a,b])。无论如何,这个答案的瓶颈是 Map()
调用;您可以使用 future_Map()
提高速度,但您必须测试您的实际数据以查看是否需要它。
我还想标记一下,我将常用产品列保留为 list-column,尽管您可能希望它采用不同的格式。现在,当没有匹配项时,它是 NULL/空字符列的混合体,因此如果您将其保留为 list-column-- 由您决定,您可能需要清理它。
解决方法:
dt_unique = unique(dt[, .(V1, V2)])
dt_pairs = dt_unique[, list(ref_list = list(unique(V2))), .(product = V1)]
dt_unique = dt_pairs[dt_unique, on = c("product" = "V2")]
setnames(dt_unique, c("V2", "V2_ref", "V1"))
dt_unique = dt_pairs[dt_unique, on = c("product" = "V1")]
setnames(dt_unique, c("V1", "V1_ref", "V2", "V2_ref"))
dt_unique[, common_prods := Map(function(x, y) unique.default(y[chmatch(x, y, 0L)]), V1_ref, V2_ref)]
dt_unique[, c("V1_ref", "V2_ref") := NULL]
dt_unique[dt, on = c("V1", "V2")]
V1 V2 common_prods correct_common_prods
1: G e
2: N L
3: Z G
4: u z
5: C d
6: t D
7: w 8
8: e T
9: d v
10: 3 b
11: C y
12: A j
13: g M
14: N Q
15: l 9
16: U 0
17: i i i,z,B,l i, z, B, l
18: z 6
19: N R
20: m d
21: v z
22: D U
23: e U
24: 7 A
25: G k
26: N S R R
27: 0 V
28: N C
29: r E
30: L a
31: T Z
32: b 4
33: U 2
34: B d
35: p v
36: f b
37: n Y
38: 6 W
39: i z
40: P V
41: o g
42: e b
43: m E
44: Y G
45: W j
46: m S
47: 1 A
48: T k
49: j 6
50: g r
51: T c
52: r Y
53: R K
54: F S
55: 4 V
56: 6 B
57: J W
58: W 4
59: f H
60: P D
61: u H
62: I t t t
63: S R
64: K m
65: e s
66: F P
67: T 3
68: l K
69: 5 i i i
70: s K
71: L d
72: q q q q
73: L r d d
74: K O
75: T N
76: t t D,t D, t
77: r d
78: O j
79: m b
80: x t t t
81: Q I
82: i B
83: O s
84: K V
85: k s
86: C B d d
87: i l
88: 7 i i i
89: F w
90: 8 X
91: E i i i
92: 3 O
93: d 6
94: s v
95: m H
96: n a
97: S 6
98: P q q q
99: o J
100: b m
V1 V2 common_prods correct_common_prods
可重现代码(带注释):
library(data.table)
n = 1e2
set.seed(1)
dt <- data.table(V1 = stringi::stri_rand_strings(n, 1),
V2 = stringi::stri_rand_strings(n, 1))
#Matching your output:
find_products <- function(a, b){
library(data.table)
toString(unique((dt[.(c(a, b)), on=.(V1), V2[duplicated(V2)]])))
}
dt[, correct_common_prods := apply(dt, 1, function(y) find_products(y[['V1']], y[['V2']]))]
# If (a, b) and (b, a) are equivalent, you'll want this instead:
# dt_unique = unique(rbindlist(list(dt[, .(V1, V2)], dt[, .(V2, V1)]), use.names = FALSE))
dt_unique = unique(dt[, .(V1, V2)])
# Creating list-column w/ corresponding products
dt_pairs = dt_unique[, list(ref_list = list(unique(V2))), .(product = V1)]
# Merging and re-naming. There may be a more data.table way to
# handle the renaming because this feels not-eloquent
dt_unique = dt_pairs[dt_unique, on = c("product" = "V2")]
setnames(dt_unique, c("V2", "V2_ref", "V1"))
dt_unique = dt_pairs[dt_unique, on = c("product" = "V1")]
setnames(dt_unique, c("V1", "V1_ref", "V2", "V2_ref"))
# This is the memory-intensive part because it checks for the intersection on
# each row. This creates a list-column `common_prods`
# OR, easier to read but slower:
# dt_unique[, common_prods := Map(intersect, V1_ref, V2_ref)]
dt_unique[, common_prods := Map(function(x, y) unique.default(y[chmatch(x, y, 0L)]), V1_ref, V2_ref)]
# Column cleanup (retain _ref columns to better understand how this works)
# then merging in the common products
dt_unique[, c("V1_ref", "V2_ref") := NULL]
dt = dt_unique[dt, on = c("V1", "V2")]
这个问题看起来像 购物篮分析,所以我建议这样处理它。
随机样本数据的问题在于,由于您的数据是随机的,因此您很可能找不到产品之间的任何强相关性;-)。
但也许(希望)你的 production-dataset(s).
不会是这种情况
datacamp 上有一个关于篮子分析的很棒的教程,我(主要)遵循了这个答案。有关更多 in-depth 信息,请确保自己按照教程进行操作。 Link 在 de 代码下面的注释中。
# Workflow adapted from
# https://www.datacamp.com/community/tutorials/market-basket-analysis-r
library(arules)
library(arulesViz)
library(tidyverse)
# Put items bought together intoa single column, separate with comma
# then convert into transactions
write.csv(setDF(dt) %>% unite("items", everything(), sep = ","),
"./temp/my_transactions.csv",
quote = FALSE, row.names = FALSE)
# Now read the csv into a transactions file
tr <- read.transactions("./temp/my_transactions.csv",
format = "basket", sep = ",")
# Frequency plot (for getting insight only, so commented out)
# itemFrequencyPlot(tr, topN = 15, type = "absolute", main="Absolute Item Frequency Plot")
# itemFrequencyPlot(tr, topN = 15, type = "relative", main="Absolute Item Frequency Plot")
# Mine rules
# since your sample data is rather small and ramdom..
# there will not be many items bought together frequently...
# so I set the confidence to 0.5 (i.e. 50%).
# If your data set grows, you should/can increase the confidence
# to get more reliable pairing
association.rules <- apriori(tr, parameter = list(supp = 0.001, conf = 0.5, maxlen = 100))
# what have we found here?
inspect(association.rules)
# lhs rhs support confidence coverage lift count
# [1] {x} => {t} 0.00990099 1.0 0.00990099 25.250000 1
# [2] {5} => {i} 0.00990099 1.0 0.00990099 14.428571 1
# [3] {c} => {T} 0.00990099 1.0 0.00990099 16.833333 1
# [4] {1} => {A} 0.00990099 1.0 0.00990099 33.666667 1
# [5] {p} => {v} 0.00990099 1.0 0.00990099 25.250000 1
# [6] {2} => {U} 0.00990099 1.0 0.00990099 25.250000 1
# [7] {9} => {l} 0.00990099 1.0 0.00990099 33.666667 1
# [8] {M} => {g} 0.00990099 1.0 0.00990099 33.666667 1
# [9] {y} => {C} 0.00990099 1.0 0.00990099 25.250000 1
# [10] {X} => {8} 0.00990099 1.0 0.00990099 50.500000 1
# [11] {8} => {X} 0.00990099 0.5 0.01980198 50.500000 1
# [12] {q} => {P} 0.00990099 0.5 0.01980198 12.625000 1
# [13] {a} => {n} 0.00990099 0.5 0.01980198 25.250000 1
# [14] {n} => {a} 0.00990099 0.5 0.01980198 25.250000 1
# [15] {a} => {L} 0.00990099 0.5 0.01980198 12.625000 1
# [16] {0} => {U} 0.00990099 0.5 0.01980198 12.625000 1
# [17] {0} => {V} 0.00990099 0.5 0.01980198 12.625000 1
# [18] {u} => {H} 0.00990099 0.5 0.01980198 16.833333 1
# [19] {u} => {z} 0.00990099 0.5 0.01980198 12.625000 1
# [20] {Q} => {I} 0.00990099 0.5 0.01980198 25.250000 1
# [21] {I} => {Q} 0.00990099 0.5 0.01980198 25.250000 1
# [22] {Q} => {N} 0.00990099 0.5 0.01980198 8.416667 1
# [23] {Z} => {G} 0.00990099 0.5 0.01980198 12.625000 1
# [24] {Z} => {T} 0.00990099 0.5 0.01980198 8.416667 1
# [25] {f} => {H} 0.00990099 0.5 0.01980198 16.833333 1
# [26] {f} => {b} 0.00990099 0.5 0.01980198 8.416667 1
# [27] {n} => {Y} 0.00990099 0.5 0.01980198 16.833333 1
# [28] {o} => {J} 0.00990099 0.5 0.01980198 25.250000 1
# [29] {J} => {o} 0.00990099 0.5 0.01980198 25.250000 1
# [30] {o} => {g} 0.00990099 0.5 0.01980198 16.833333 1
# [31] {J} => {W} 0.00990099 0.5 0.01980198 12.625000 1
# [32] {I} => {t} 0.00990099 0.5 0.01980198 12.625000 1
# [33] {w} => {8} 0.00990099 0.5 0.01980198 25.250000 1
# [34] {8} => {w} 0.00990099 0.5 0.01980198 25.250000 1
# [35] {w} => {F} 0.00990099 0.5 0.01980198 16.833333 1
# [36] {7} => {A} 0.00990099 0.5 0.01980198 16.833333 1
# [37] {7} => {i} 0.00990099 0.5 0.01980198 7.214286 1
# what does the above mean?
# [1] 100% of people that have bought x, also bought t
# [11] 50% of people that have bought 8, also bought X
# visualise
plot(association.rules, method = "graph", engine = "htmlwidget")
plot(association.rules, method="paracoord")
如果我误解了这个问题,我深表歉意(它确实重现了示例,fwiw),但这似乎作为一个简单的连接可能更有效。对于 n = 10M,这将在大约 3 秒内运行。
我的 data.table
生锈了,所以我正在使用 dtplyr
将我的 dplyr
语法转换为 data.table
语法。我想有 data.table
优化可以使它更快,或者你可以使用 collapse
包。
设置
library(data.table)
library(dplyr)
library(dtplyr)
library(stringi)
n = 10000000
set.seed(1)
dt <- data.table(V1 = stri_rand_strings(n, 2), # 3,844 groups
V2 = stri_rand_strings(n, 2))
将 V2 加入每个关联的 V1
dt_single <- distinct(dt, V1)
dt_single %>%
left_join(dt, by = c("V1" = "V1")) %>%
group_by(V1) %>%
summarize(common_products = paste(V2, sep = ", ", collapse = ", "))
要获得与 OP 格式相同的输出,我们可以将上面的代码分配给 -> common
然后执行:
dt %>%
left_join(common) %>%
mutate(common_products = if_else(V1 == V2, common_products, "")) %>%
select(V1, V2, common_products)
获得 OP 所需的输出,据我所知:
V1 V2 common_products
<chr> <chr> <chr>
1 G e
2 N L
3 Z G
4 u z
5 C d
6 t D
7 w 8
8 e T
9 d v
10 3 b
11 C y
12 A j
13 g M
14 N Q
15 l 9
16 U 0
17 i i i, z, B, l
...
假设我有一个 data.table
,其中观察结果是我的消费者一起购买的产品的成对组合。
我想知道,对于我的 data.table
中的每一对产品(dt
中的一行),他们是否有第三种共同的产品,有时也与其中一种产品一起购买产品。
我想在 dt
中将“常用产品”作为新列包括在内。
目前,我是这样操作的。但我的真实数据拥有数百万行。从1周开始计算数据需要20个小时。
我怎样才能加快速度? apply
函数是智能的,还是我应该考虑映射?
模拟示例:
library(data.table)
library(stringi)
library(future.apply)
set.seed(1)
# build mock data
dt <- data.table(V1 = stri_rand_strings(100, 1),
V2 = stri_rand_strings(100, 1))
head(dt,17)
# V1 V2
#1: G e
#2: N L
#3: Z G
#4: u z
#5: C d
#6: t D
# 7: w 8
# 8: e T
# 9: d v
#10: 3 b
#11: C y
#12: A j
#13: g M
#14: N Q
#15: l 9
#16: U 0
#17: i i
#function to find common products
find_products <- function(a, b){
library(data.table)
toString(unique((dt[.(c(a, b)), on=.(V1), V2[duplicated(V2)]])))
}
#initiate parallel processing
plan(multisession) # on Windows machine - use plan(multicore) on Linux
#apply function across rows
common_products <- future_apply(dt, 1, function(y) find_products(y['V1'], y['V2']))
dt_final <- cbind(dt, common_products)
#head(dt, 17)
# V1 V2 common_products
# 1: G e
# 2: N L
# 3: Z G
# 4: u z
# 5: C d
# 6: t D
# 7: w 8
# 8: e T
# 9: d v
#10: 3 b
#11: C y
#12: A j
#13: g M
#14: N Q
#15: l 9
#16: U 0
#17: i i i, z, B, l
更新 3(因为 ego
在大图中崩溃)
Update 2 中的 ego
方法似乎由于网络太大而崩溃,我们可能可以按行方式尝试
g <- simplify(graph_from_data_frame(dt, directed = FALSE))
nms <- names(V(g))
dt[
,
common := toString(nms[do.call(intersect, ego(g, 1, unlist(.(V1, V2)), mindist = 1))]),
.(id = seq_along(V1))
]
更新 2(解决使用 igraph
时的崩溃问题)
考虑到在大的igraph
对象上使用triangles
或subgraph_isomorphisms
可能导致的崩溃,我们可以使用下面的代码(可能会牺牲一些速度但应该是稳定的解决方法)
g <- simplify(graph_from_data_frame(dt, directed = FALSE))
nms <- names(V(g))
transform(
dt,
common = Map(
function(...) nms[intersect(...)],
ego(g, 1, V1, mindist = 1),
ego(g, 1, V2, mindist = 1)
)
)
更新
我想你可以尝试 subgraph_isomorphisms
找到所有具有所有排列的三角形并合并到 dt
g <- simplify(graph_from_data_frame(dt, directed = FALSE))
dtg <- as.data.table(do.call(rbind, Map(names, subgraph_isomorphisms(make_ring(3), g))))
dtg[
dt,
on = .(V1 = V1, V2 = V2)
][
,
.(common = toString(na.omit(V3))), .(V1, V2)
][
V1 == V2,
common := sapply(V1, function(x) toString(names(neighbors(g, x))))
][]
哪个应该很快并且可以给你
V1 V2 common
1: G e
2: N L
3: Z G
4: u z
5: C d B
6: t D
7: w 8
8: e T
9: d v
10: 3 b
11: C y
12: A j
13: g M
14: N Q
15: l 9
16: U 0
17: i i l, z, 7, B, 5, E
18: z 6
19: N R S
20: m d
21: v z
22: D U
23: e U
24: 7 A
25: G k
26: N S R
27: 0 V
28: N C
29: r E
30: L a
31: T Z
32: b 4
33: U 2
34: B d C, 6
35: p v
36: f b
37: n Y
38: 6 W j
39: i z
40: P V
41: o g
42: e b
43: m E
44: Y G
45: W j 6
46: m S
47: 1 A
48: T k
49: j 6 W
50: g r
51: T c
52: r Y
53: R K
54: F S
55: 4 V
56: 6 B d
57: J W
58: W 4
59: f H
60: P D
61: u H
62: I t
63: S R N
64: K m
65: e s
66: F P
67: T 3
68: l K
69: 5 i
70: s K O
71: L d r
72: q q P
73: L r d
74: K O s
75: T N
76: t t D, I, x
77: r d L
78: O j
79: m b
80: x t
81: Q I
82: i B
83: O s K
84: K V
85: k s
86: C B d
87: i l
88: 7 i
89: F w
90: 8 X
91: E i
92: 3 O
93: d 6 B
94: s v
95: m H
96: n a
97: S 6
98: P q
99: o J
100: b m
V1 V2 common
上一个答案
也许这会有所帮助
library(igraph)
g <- simplify(graph_from_data_frame(dt, directed = FALSE))
dcast(
melt(dt[, id := 1:.N], "id")[
,
common := toString(names(V(g))[do.call(intersect, ego(g, nodes = value, mindist = 1))]),
id
],
id + common ~ variable
)[, .(V1, V2, common)]
这给出了
V1 V2 common
1: G e
2: N L
3: Z G
4: u z
5: C d B
6: t D
7: w 8
8: e T
9: d v
10: 3 b
11: C y
12: A j
13: g M
14: N Q
15: l 9
16: U 0
17: i i l, z, 7, B, 5, E
18: z 6
19: N R S
20: m d
21: v z
22: D U
23: e U
24: 7 A
25: G k
26: N S R
27: 0 V
28: N C
29: r E
30: L a
31: T Z
32: b 4
33: U 2
34: B d C, 6
35: p v
36: f b
37: n Y
38: 6 W j
39: i z
40: P V
41: o g
42: e b
43: m E
44: Y G
45: W j 6
46: m S
47: 1 A
48: T k
49: j 6 W
50: g r
51: T c
52: r Y
53: R K
54: F S
55: 4 V
56: 6 B d
57: J W
58: W 4
59: f H
60: P D
61: u H
62: I t
63: S R N
64: K m
65: e s
66: F P
67: T 3
68: l K
69: 5 i
70: s K O
71: L d r
72: q q P
73: L r d
74: K O s
75: T N
76: t t D, I, x
77: r d L
78: O j
79: m b
80: x t
81: Q I
82: i B
83: O s K
84: K V
85: k s
86: C B d
87: i l
88: 7 i
89: F w
90: 8 X
91: E i
92: 3 O
93: d 6 B
94: s v
95: m H
96: n a
97: S 6
98: P q
99: o J
100: b m
V1 V2 common
可以将 three-way 对视为无向图中的三角形。包 igraph
可以有效地找到这些。我包括了一个包含 20M 对 3 字符产品代码的示例。它 运行 在大约 70 秒内在单个线程上运行。
library(data.table)
library(stringi)
library(igraph)
getpaired <- function(g, id) names(unique(neighbors(g, id)))
commonProducts <- function(dt) {
blnSort <- dt$V1 > dt$V2
dt[blnSort, c("V2", "V1") := list(V1, V2)] # sort each row
# get triangles
g <- graph_from_data_frame(dt, FALSE)
m <- matrix(V(g)$name[triangles(g)], ncol = 3, byrow = TRUE)
# sort each row
m <- matrix(m[order(row(m), m, method = "radix")], ncol = 3, byrow = TRUE)
dt3 <- as.data.table(m)
# map common products back to the original dataframe
dt3 <- rbindlist(
list(
# the three ordered pairs in each triangle
dt3,
dt3[, c(1, 3, 2)],
dt3[, c(2, 3, 1)],
# common products in "two-sided" triangles
dt[V1 == V2][
, .(V2 = V2, V3 = .(getpaired(g, V1))), by = "V1"
][
, .(V1 = rep(rep.int(V1, lengths(V3)), 2),
V2 = c(rep.int(V1, lengths(V3)), unlist(V3)),
V3 = c(unlist(V3), rep.int(V1, lengths(V3))))
][ # sort (V1, V2) in each row
V1 > V2, c("V2", "V1") := list(V1, V2)
]
),
FALSE # bind by index
)[ # collapse common products into a single vector for each pair
, .(V3 = .(V3)),
by = c("V1", "V2")
][ # join into the original (row-sorted) data.table
dt, on = c("V1", "V2")
][ # unsort V1, V2 in each row to match the original (unsorted) data.table
, c("V1", "V2") := dt[blnSort, c("V2", "V1") := list(V1, V2)]
]
}
set.seed(1)
# build mock data
dt <- data.table(V1 = stri_rand_strings(100, 1),
V2 = stri_rand_strings(100, 1))
dt3 <- commonProducts(dt)
print(dt3)
#> V1 V2 V3
#> 1: G e
#> 2: N L
#> 3: Z G
#> 4: u z
#> 5: C d B
#> 6: t D t
#> 7: w 8
#> 8: e T
#> 9: d v
#> 10: 3 b
#> 11: C y
#> 12: A j
#> 13: g M
#> 14: N Q
#> 15: l 9
#> 16: U 0
#> 17: i i i,7,E,B,5,z,...
#> 18: z 6
#> 19: N R S
#> 20: m d
#> 21: v z
#> 22: D U
#> 23: e U
#> 24: 7 A
#> 25: G k
#> 26: N S R
#> 27: 0 V
#> 28: N C
#> 29: r E
#> 30: L a
#> 31: T Z
#> 32: b 4
#> 33: U 2
#> 34: B d C,6
#> 35: p v
#> 36: f b
#> 37: n Y
#> 38: 6 W j
#> 39: i z i
#> 40: P V
#> 41: o g
#> 42: e b
#> 43: m E
#> 44: Y G
#> 45: W j
#> 46: m S
#> 47: 1 A
#> 48: T k
#> 49: j 6 W
#> 50: g r
#> 51: T c
#> 52: r Y
#> 53: R K
#> 54: F S
#> 55: 4 V
#> 56: 6 B d
#> 57: J W
#> 58: W 4
#> 59: f H
#> 60: P D
#> 61: u H
#> 62: I t t
#> 63: S R N
#> 64: K m
#> 65: e s
#> 66: F P
#> 67: T 3
#> 68: l K
#> 69: 5 i i
#> 70: s K O
#> 71: L d
#> 72: q q P,q,q
#> 73: L r d
#> 74: K O s
#> 75: T N
#> 76: t t D,I,t,x,t
#> 77: r d L
#> 78: O j
#> 79: m b
#> 80: x t t
#> 81: Q I
#> 82: i B i
#> 83: O s K
#> 84: K V
#> 85: k s
#> 86: C B d
#> 87: i l i
#> 88: 7 i i
#> 89: F w
#> 90: 8 X
#> 91: E i i
#> 92: 3 O
#> 93: d 6 B
#> 94: s v
#> 95: m H
#> 96: n a
#> 97: S 6
#> 98: P q q
#> 99: o J
#> 100: b m
#> V1 V2 V3
# timing a much larger dataset
dt <- data.table(V1 = stri_rand_strings(2e7, 3),
V2 = stri_rand_strings(2e7, 3))
system.time(dt3 <- commonProducts(dt))
#> user system elapsed
#> 72.75 3.05 71.88
dt3[lengths(V3) != 0L] # show only those pairs with common products
#> V1 V2 V3
#> 1: GBW mDN lxF
#> 2: ix6 jpR 0VI
#> 3: xLG VeE aik
#> 4: A36 RzJ YYu
#> 5: zAo OYu zAo
#> ---
#> 1841567: qX9 xrW 7lb
#> 1841568: knO 2G6 knO
#> 1841569: rsU 5Rw ER8
#> 1841570: Bts 3L1 1bQ
#> 1841571: c0h pgd jxJ
这处理 V1
==V2
时创建的“双边三角形”(与 OP 示例数据中的第 17 行一样)。例如,如果整个数据集由 (t, i)
和 (i, i)
对组成,那么 i
将是 (t, i)
的共同产品(i
与两者配对t
和 i
),i, t
将是 (i, i)
的常见产品(i
和 t
分别与 i
和 i
).
我相信你可以坚持data.table来完成你想做的事情。正如评论者已经指出的那样,您需要确定产品配对 [a,b] 是否等同于 [b,a](在您的示例答案中,它仅适用于配对 [a,b])。无论如何,这个答案的瓶颈是 Map()
调用;您可以使用 future_Map()
提高速度,但您必须测试您的实际数据以查看是否需要它。
我还想标记一下,我将常用产品列保留为 list-column,尽管您可能希望它采用不同的格式。现在,当没有匹配项时,它是 NULL/空字符列的混合体,因此如果您将其保留为 list-column-- 由您决定,您可能需要清理它。
解决方法:
dt_unique = unique(dt[, .(V1, V2)])
dt_pairs = dt_unique[, list(ref_list = list(unique(V2))), .(product = V1)]
dt_unique = dt_pairs[dt_unique, on = c("product" = "V2")]
setnames(dt_unique, c("V2", "V2_ref", "V1"))
dt_unique = dt_pairs[dt_unique, on = c("product" = "V1")]
setnames(dt_unique, c("V1", "V1_ref", "V2", "V2_ref"))
dt_unique[, common_prods := Map(function(x, y) unique.default(y[chmatch(x, y, 0L)]), V1_ref, V2_ref)]
dt_unique[, c("V1_ref", "V2_ref") := NULL]
dt_unique[dt, on = c("V1", "V2")]
V1 V2 common_prods correct_common_prods
1: G e
2: N L
3: Z G
4: u z
5: C d
6: t D
7: w 8
8: e T
9: d v
10: 3 b
11: C y
12: A j
13: g M
14: N Q
15: l 9
16: U 0
17: i i i,z,B,l i, z, B, l
18: z 6
19: N R
20: m d
21: v z
22: D U
23: e U
24: 7 A
25: G k
26: N S R R
27: 0 V
28: N C
29: r E
30: L a
31: T Z
32: b 4
33: U 2
34: B d
35: p v
36: f b
37: n Y
38: 6 W
39: i z
40: P V
41: o g
42: e b
43: m E
44: Y G
45: W j
46: m S
47: 1 A
48: T k
49: j 6
50: g r
51: T c
52: r Y
53: R K
54: F S
55: 4 V
56: 6 B
57: J W
58: W 4
59: f H
60: P D
61: u H
62: I t t t
63: S R
64: K m
65: e s
66: F P
67: T 3
68: l K
69: 5 i i i
70: s K
71: L d
72: q q q q
73: L r d d
74: K O
75: T N
76: t t D,t D, t
77: r d
78: O j
79: m b
80: x t t t
81: Q I
82: i B
83: O s
84: K V
85: k s
86: C B d d
87: i l
88: 7 i i i
89: F w
90: 8 X
91: E i i i
92: 3 O
93: d 6
94: s v
95: m H
96: n a
97: S 6
98: P q q q
99: o J
100: b m
V1 V2 common_prods correct_common_prods
可重现代码(带注释):
library(data.table)
n = 1e2
set.seed(1)
dt <- data.table(V1 = stringi::stri_rand_strings(n, 1),
V2 = stringi::stri_rand_strings(n, 1))
#Matching your output:
find_products <- function(a, b){
library(data.table)
toString(unique((dt[.(c(a, b)), on=.(V1), V2[duplicated(V2)]])))
}
dt[, correct_common_prods := apply(dt, 1, function(y) find_products(y[['V1']], y[['V2']]))]
# If (a, b) and (b, a) are equivalent, you'll want this instead:
# dt_unique = unique(rbindlist(list(dt[, .(V1, V2)], dt[, .(V2, V1)]), use.names = FALSE))
dt_unique = unique(dt[, .(V1, V2)])
# Creating list-column w/ corresponding products
dt_pairs = dt_unique[, list(ref_list = list(unique(V2))), .(product = V1)]
# Merging and re-naming. There may be a more data.table way to
# handle the renaming because this feels not-eloquent
dt_unique = dt_pairs[dt_unique, on = c("product" = "V2")]
setnames(dt_unique, c("V2", "V2_ref", "V1"))
dt_unique = dt_pairs[dt_unique, on = c("product" = "V1")]
setnames(dt_unique, c("V1", "V1_ref", "V2", "V2_ref"))
# This is the memory-intensive part because it checks for the intersection on
# each row. This creates a list-column `common_prods`
# OR, easier to read but slower:
# dt_unique[, common_prods := Map(intersect, V1_ref, V2_ref)]
dt_unique[, common_prods := Map(function(x, y) unique.default(y[chmatch(x, y, 0L)]), V1_ref, V2_ref)]
# Column cleanup (retain _ref columns to better understand how this works)
# then merging in the common products
dt_unique[, c("V1_ref", "V2_ref") := NULL]
dt = dt_unique[dt, on = c("V1", "V2")]
这个问题看起来像 购物篮分析,所以我建议这样处理它。
随机样本数据的问题在于,由于您的数据是随机的,因此您很可能找不到产品之间的任何强相关性;-)。 但也许(希望)你的 production-dataset(s).
不会是这种情况datacamp 上有一个关于篮子分析的很棒的教程,我(主要)遵循了这个答案。有关更多 in-depth 信息,请确保自己按照教程进行操作。 Link 在 de 代码下面的注释中。
# Workflow adapted from
# https://www.datacamp.com/community/tutorials/market-basket-analysis-r
library(arules)
library(arulesViz)
library(tidyverse)
# Put items bought together intoa single column, separate with comma
# then convert into transactions
write.csv(setDF(dt) %>% unite("items", everything(), sep = ","),
"./temp/my_transactions.csv",
quote = FALSE, row.names = FALSE)
# Now read the csv into a transactions file
tr <- read.transactions("./temp/my_transactions.csv",
format = "basket", sep = ",")
# Frequency plot (for getting insight only, so commented out)
# itemFrequencyPlot(tr, topN = 15, type = "absolute", main="Absolute Item Frequency Plot")
# itemFrequencyPlot(tr, topN = 15, type = "relative", main="Absolute Item Frequency Plot")
# Mine rules
# since your sample data is rather small and ramdom..
# there will not be many items bought together frequently...
# so I set the confidence to 0.5 (i.e. 50%).
# If your data set grows, you should/can increase the confidence
# to get more reliable pairing
association.rules <- apriori(tr, parameter = list(supp = 0.001, conf = 0.5, maxlen = 100))
# what have we found here?
inspect(association.rules)
# lhs rhs support confidence coverage lift count
# [1] {x} => {t} 0.00990099 1.0 0.00990099 25.250000 1
# [2] {5} => {i} 0.00990099 1.0 0.00990099 14.428571 1
# [3] {c} => {T} 0.00990099 1.0 0.00990099 16.833333 1
# [4] {1} => {A} 0.00990099 1.0 0.00990099 33.666667 1
# [5] {p} => {v} 0.00990099 1.0 0.00990099 25.250000 1
# [6] {2} => {U} 0.00990099 1.0 0.00990099 25.250000 1
# [7] {9} => {l} 0.00990099 1.0 0.00990099 33.666667 1
# [8] {M} => {g} 0.00990099 1.0 0.00990099 33.666667 1
# [9] {y} => {C} 0.00990099 1.0 0.00990099 25.250000 1
# [10] {X} => {8} 0.00990099 1.0 0.00990099 50.500000 1
# [11] {8} => {X} 0.00990099 0.5 0.01980198 50.500000 1
# [12] {q} => {P} 0.00990099 0.5 0.01980198 12.625000 1
# [13] {a} => {n} 0.00990099 0.5 0.01980198 25.250000 1
# [14] {n} => {a} 0.00990099 0.5 0.01980198 25.250000 1
# [15] {a} => {L} 0.00990099 0.5 0.01980198 12.625000 1
# [16] {0} => {U} 0.00990099 0.5 0.01980198 12.625000 1
# [17] {0} => {V} 0.00990099 0.5 0.01980198 12.625000 1
# [18] {u} => {H} 0.00990099 0.5 0.01980198 16.833333 1
# [19] {u} => {z} 0.00990099 0.5 0.01980198 12.625000 1
# [20] {Q} => {I} 0.00990099 0.5 0.01980198 25.250000 1
# [21] {I} => {Q} 0.00990099 0.5 0.01980198 25.250000 1
# [22] {Q} => {N} 0.00990099 0.5 0.01980198 8.416667 1
# [23] {Z} => {G} 0.00990099 0.5 0.01980198 12.625000 1
# [24] {Z} => {T} 0.00990099 0.5 0.01980198 8.416667 1
# [25] {f} => {H} 0.00990099 0.5 0.01980198 16.833333 1
# [26] {f} => {b} 0.00990099 0.5 0.01980198 8.416667 1
# [27] {n} => {Y} 0.00990099 0.5 0.01980198 16.833333 1
# [28] {o} => {J} 0.00990099 0.5 0.01980198 25.250000 1
# [29] {J} => {o} 0.00990099 0.5 0.01980198 25.250000 1
# [30] {o} => {g} 0.00990099 0.5 0.01980198 16.833333 1
# [31] {J} => {W} 0.00990099 0.5 0.01980198 12.625000 1
# [32] {I} => {t} 0.00990099 0.5 0.01980198 12.625000 1
# [33] {w} => {8} 0.00990099 0.5 0.01980198 25.250000 1
# [34] {8} => {w} 0.00990099 0.5 0.01980198 25.250000 1
# [35] {w} => {F} 0.00990099 0.5 0.01980198 16.833333 1
# [36] {7} => {A} 0.00990099 0.5 0.01980198 16.833333 1
# [37] {7} => {i} 0.00990099 0.5 0.01980198 7.214286 1
# what does the above mean?
# [1] 100% of people that have bought x, also bought t
# [11] 50% of people that have bought 8, also bought X
# visualise
plot(association.rules, method = "graph", engine = "htmlwidget")
plot(association.rules, method="paracoord")
如果我误解了这个问题,我深表歉意(它确实重现了示例,fwiw),但这似乎作为一个简单的连接可能更有效。对于 n = 10M,这将在大约 3 秒内运行。
我的 data.table
生锈了,所以我正在使用 dtplyr
将我的 dplyr
语法转换为 data.table
语法。我想有 data.table
优化可以使它更快,或者你可以使用 collapse
包。
设置
library(data.table)
library(dplyr)
library(dtplyr)
library(stringi)
n = 10000000
set.seed(1)
dt <- data.table(V1 = stri_rand_strings(n, 2), # 3,844 groups
V2 = stri_rand_strings(n, 2))
将 V2 加入每个关联的 V1
dt_single <- distinct(dt, V1)
dt_single %>%
left_join(dt, by = c("V1" = "V1")) %>%
group_by(V1) %>%
summarize(common_products = paste(V2, sep = ", ", collapse = ", "))
要获得与 OP 格式相同的输出,我们可以将上面的代码分配给 -> common
然后执行:
dt %>%
left_join(common) %>%
mutate(common_products = if_else(V1 == V2, common_products, "")) %>%
select(V1, V2, common_products)
获得 OP 所需的输出,据我所知:
V1 V2 common_products
<chr> <chr> <chr>
1 G e
2 N L
3 Z G
4 u z
5 C d
6 t D
7 w 8
8 e T
9 d v
10 3 b
11 C y
12 A j
13 g M
14 N Q
15 l 9
16 U 0
17 i i i, z, B, l
...