R - 使用 apply 加速搜索

R - Speeding up a search with apply

假设我有一个 data.table,其中观察结果是我的消费者一起购买的产品的成对组合。

我想知道,对于我的 data.table 中的每一对产品(dt 中的一行),他们是否有第三种共同的产品,有时也与其中一种产品一起购买产品。

我想在 dt 中将“常用产品”作为新列包括在内。

目前,我是这样操作的。但我的真实数据拥有数百万行。从1周开始计算数据需要20个小时。

我怎样才能加快速度? apply 函数是智能的,还是我应该考虑映射?

模拟示例:

library(data.table)
library(stringi)
library(future.apply)

set.seed(1)

# build mock data
dt <- data.table(V1 = stri_rand_strings(100, 1),
                 V2 = stri_rand_strings(100, 1))

head(dt,17)
#   V1 V2
#1:  G  e
#2:  N  L
#3:  Z  G
#4:  u  z
#5:  C  d
#6:  t  D
# 7:  w  8
# 8:  e  T
# 9:  d  v
#10:  3  b
#11:  C  y
#12:  A  j
#13:  g  M
#14:  N  Q
#15:  l  9
#16:  U  0
#17:  i  i

#function to find common products
find_products <- function(a, b){
  library(data.table)
  toString(unique((dt[.(c(a, b)), on=.(V1), V2[duplicated(V2)]])))
}

#initiate parallel processing
plan(multisession) # on Windows machine - use plan(multicore) on Linux

#apply function across rows
common_products <- future_apply(dt, 1, function(y) find_products(y['V1'], y['V2']))

dt_final <- cbind(dt, common_products)

#head(dt, 17)
#    V1 V2 common_products
# 1:  G  e                
# 2:  N  L                
# 3:  Z  G                
# 4:  u  z                
# 5:  C  d                
# 6:  t  D                
# 7:  w  8                
# 8:  e  T                
# 9:  d  v                
#10:  3  b                
#11:  C  y                
#12:  A  j                
#13:  g  M                
#14:  N  Q                
#15:  l  9                
#16:  U  0                
#17:  i  i      i, z, B, l

更新 3(因为 ego 在大图中崩溃)

Update 2 中的 ego 方法似乎由于网络太大而崩溃,我们可能可以按行方式尝试

g <- simplify(graph_from_data_frame(dt, directed = FALSE))
nms <- names(V(g))
dt[
    ,
    common := toString(nms[do.call(intersect, ego(g, 1, unlist(.(V1, V2)), mindist = 1))]),
    .(id = seq_along(V1))
]

更新 2(解决使用 igraph 时的崩溃问题)

考虑到在大的igraph对象上使用trianglessubgraph_isomorphisms可能导致的崩溃,我们可以使用下面的代码(可能会牺牲一些速度但应该是稳定的解决方法)

g <- simplify(graph_from_data_frame(dt, directed = FALSE))
nms <- names(V(g))
transform(
    dt,
    common = Map(
        function(...) nms[intersect(...)],
        ego(g, 1, V1, mindist = 1), 
        ego(g, 1, V2, mindist = 1)
    )
)

更新

我想你可以尝试 subgraph_isomorphisms 找到所有具有所有排列的三角形并合并到 dt

g <- simplify(graph_from_data_frame(dt, directed = FALSE))
dtg <- as.data.table(do.call(rbind, Map(names, subgraph_isomorphisms(make_ring(3), g))))
dtg[
    dt,
    on = .(V1 = V1, V2 = V2)
][
    ,
    .(common = toString(na.omit(V3))), .(V1, V2)
][
    V1 == V2,
    common := sapply(V1, function(x) toString(names(neighbors(g, x))))
][]

哪个应该很快并且可以给你

     V1 V2           common
  1:  G  e
  2:  N  L
  3:  Z  G
  4:  u  z
  5:  C  d                B
  6:  t  D
  7:  w  8
  8:  e  T
  9:  d  v
 10:  3  b
 11:  C  y
 12:  A  j
 13:  g  M
 14:  N  Q
 15:  l  9
 16:  U  0
 17:  i  i l, z, 7, B, 5, E
 18:  z  6
 19:  N  R                S
 20:  m  d
 21:  v  z
 22:  D  U
 23:  e  U
 24:  7  A
 25:  G  k
 26:  N  S                R
 27:  0  V
 28:  N  C
 29:  r  E
 30:  L  a
 31:  T  Z
 32:  b  4
 33:  U  2
 34:  B  d             C, 6
 35:  p  v
 36:  f  b
 37:  n  Y
 38:  6  W                j
 39:  i  z
 40:  P  V
 41:  o  g
 42:  e  b
 43:  m  E
 44:  Y  G
 45:  W  j                6
 46:  m  S
 47:  1  A
 48:  T  k
 49:  j  6                W
 50:  g  r
 51:  T  c
 52:  r  Y
 53:  R  K
 54:  F  S
 55:  4  V
 56:  6  B                d
 57:  J  W
 58:  W  4
 59:  f  H
 60:  P  D
 61:  u  H
 62:  I  t
 63:  S  R                N
 64:  K  m
 65:  e  s
 66:  F  P
 67:  T  3
 68:  l  K
 69:  5  i
 70:  s  K                O
 71:  L  d                r
 72:  q  q                P
 73:  L  r                d
 74:  K  O                s
 75:  T  N
 76:  t  t          D, I, x
 77:  r  d                L
 78:  O  j
 79:  m  b
 80:  x  t
 81:  Q  I
 82:  i  B
 83:  O  s                K
 84:  K  V
 85:  k  s
 86:  C  B                d
 87:  i  l
 88:  7  i
 89:  F  w
 90:  8  X
 91:  E  i
 92:  3  O
 93:  d  6                B
 94:  s  v
 95:  m  H
 96:  n  a
 97:  S  6
 98:  P  q
 99:  o  J
100:  b  m
     V1 V2           common

上一个答案

也许这会有所帮助

library(igraph)

g <- simplify(graph_from_data_frame(dt, directed = FALSE))
dcast(
    melt(dt[, id := 1:.N], "id")[
        ,
        common := toString(names(V(g))[do.call(intersect, ego(g, nodes = value, mindist = 1))]),
        id
    ],
    id + common ~ variable
)[, .(V1, V2, common)]

这给出了

     V1 V2           common
  1:  G  e
  2:  N  L
  3:  Z  G
  4:  u  z
  5:  C  d                B
  6:  t  D
  7:  w  8
  8:  e  T
  9:  d  v
 10:  3  b
 11:  C  y
 12:  A  j
 13:  g  M
 14:  N  Q
 15:  l  9
 16:  U  0
 17:  i  i l, z, 7, B, 5, E
 18:  z  6
 19:  N  R                S
 20:  m  d
 21:  v  z
 22:  D  U
 23:  e  U
 24:  7  A
 25:  G  k
 26:  N  S                R
 27:  0  V
 28:  N  C
 29:  r  E
 30:  L  a
 31:  T  Z
 32:  b  4
 33:  U  2
 34:  B  d             C, 6
 35:  p  v
 36:  f  b
 37:  n  Y
 38:  6  W                j
 39:  i  z
 40:  P  V
 41:  o  g
 42:  e  b
 43:  m  E
 44:  Y  G
 45:  W  j                6
 46:  m  S
 47:  1  A
 48:  T  k
 49:  j  6                W
 50:  g  r
 51:  T  c
 52:  r  Y
 53:  R  K
 54:  F  S
 55:  4  V
 56:  6  B                d
 57:  J  W
 58:  W  4
 59:  f  H
 60:  P  D
 61:  u  H
 62:  I  t
 63:  S  R                N
 64:  K  m
 65:  e  s
 66:  F  P
 67:  T  3
 68:  l  K
 69:  5  i
 70:  s  K                O
 71:  L  d                r
 72:  q  q                P
 73:  L  r                d
 74:  K  O                s
 75:  T  N
 76:  t  t          D, I, x
 77:  r  d                L
 78:  O  j
 79:  m  b
 80:  x  t
 81:  Q  I
 82:  i  B
 83:  O  s                K
 84:  K  V
 85:  k  s
 86:  C  B                d
 87:  i  l
 88:  7  i
 89:  F  w
 90:  8  X
 91:  E  i
 92:  3  O
 93:  d  6                B
 94:  s  v
 95:  m  H
 96:  n  a
 97:  S  6
 98:  P  q
 99:  o  J
100:  b  m
     V1 V2           common

可以将 three-way 对视为无向图中的三角形。包 igraph 可以有效地找到这些。我包括了一个包含 20M 对 3 字符产品代码的示例。它 运行 在大约 70 秒内在单个线程上运行。

library(data.table)
library(stringi)
library(igraph)

getpaired <- function(g, id) names(unique(neighbors(g, id)))
commonProducts <- function(dt) {
  blnSort <- dt$V1 > dt$V2
  dt[blnSort, c("V2", "V1") := list(V1, V2)] # sort each row
  # get triangles
  g <- graph_from_data_frame(dt, FALSE)
  m <- matrix(V(g)$name[triangles(g)], ncol = 3, byrow = TRUE)
  # sort each row
  m <- matrix(m[order(row(m), m, method = "radix")], ncol = 3, byrow = TRUE)
  dt3 <- as.data.table(m)
  # map common products back to the original dataframe
  dt3 <- rbindlist(
    list(
      # the three ordered pairs in each triangle
      dt3,
      dt3[, c(1, 3, 2)],
      dt3[, c(2, 3, 1)],
      # common products in "two-sided" triangles
      dt[V1 == V2][
        , .(V2 = V2, V3 = .(getpaired(g, V1))), by = "V1"
      ][
        , .(V1 = rep(rep.int(V1, lengths(V3)), 2),
            V2 = c(rep.int(V1, lengths(V3)), unlist(V3)),
            V3 = c(unlist(V3), rep.int(V1, lengths(V3))))
      ][ # sort (V1, V2) in each row
        V1 > V2, c("V2", "V1") := list(V1, V2)
      ]
    ),
    FALSE # bind by index
  )[ # collapse common products into a single vector for each pair
    , .(V3 = .(V3)),
    by = c("V1", "V2")
  ][ # join into the original (row-sorted) data.table
    dt, on = c("V1", "V2")
  ][ # unsort V1, V2 in each row to match the original (unsorted) data.table
    , c("V1", "V2") := dt[blnSort, c("V2", "V1") := list(V1, V2)]
  ]
}

set.seed(1)

# build mock data
dt <- data.table(V1 = stri_rand_strings(100, 1),
                 V2 = stri_rand_strings(100, 1))

dt3 <- commonProducts(dt)
print(dt3)
#>      V1 V2              V3
#>   1:  G  e                
#>   2:  N  L                
#>   3:  Z  G                
#>   4:  u  z                
#>   5:  C  d               B
#>   6:  t  D               t
#>   7:  w  8                
#>   8:  e  T                
#>   9:  d  v                
#>  10:  3  b                
#>  11:  C  y                
#>  12:  A  j                
#>  13:  g  M                
#>  14:  N  Q                
#>  15:  l  9                
#>  16:  U  0                
#>  17:  i  i i,7,E,B,5,z,...
#>  18:  z  6                
#>  19:  N  R               S
#>  20:  m  d                
#>  21:  v  z                
#>  22:  D  U                
#>  23:  e  U                
#>  24:  7  A                
#>  25:  G  k                
#>  26:  N  S               R
#>  27:  0  V                
#>  28:  N  C                
#>  29:  r  E                
#>  30:  L  a                
#>  31:  T  Z                
#>  32:  b  4                
#>  33:  U  2                
#>  34:  B  d             C,6
#>  35:  p  v                
#>  36:  f  b                
#>  37:  n  Y                
#>  38:  6  W               j
#>  39:  i  z               i
#>  40:  P  V                
#>  41:  o  g                
#>  42:  e  b                
#>  43:  m  E                
#>  44:  Y  G                
#>  45:  W  j                
#>  46:  m  S                
#>  47:  1  A                
#>  48:  T  k                
#>  49:  j  6               W
#>  50:  g  r                
#>  51:  T  c                
#>  52:  r  Y                
#>  53:  R  K                
#>  54:  F  S                
#>  55:  4  V                
#>  56:  6  B               d
#>  57:  J  W                
#>  58:  W  4                
#>  59:  f  H                
#>  60:  P  D                
#>  61:  u  H                
#>  62:  I  t               t
#>  63:  S  R               N
#>  64:  K  m                
#>  65:  e  s                
#>  66:  F  P                
#>  67:  T  3                
#>  68:  l  K                
#>  69:  5  i               i
#>  70:  s  K               O
#>  71:  L  d                
#>  72:  q  q           P,q,q
#>  73:  L  r               d
#>  74:  K  O               s
#>  75:  T  N                
#>  76:  t  t       D,I,t,x,t
#>  77:  r  d               L
#>  78:  O  j                
#>  79:  m  b                
#>  80:  x  t               t
#>  81:  Q  I                
#>  82:  i  B               i
#>  83:  O  s               K
#>  84:  K  V                
#>  85:  k  s                
#>  86:  C  B               d
#>  87:  i  l               i
#>  88:  7  i               i
#>  89:  F  w                
#>  90:  8  X                
#>  91:  E  i               i
#>  92:  3  O                
#>  93:  d  6               B
#>  94:  s  v                
#>  95:  m  H                
#>  96:  n  a                
#>  97:  S  6                
#>  98:  P  q               q
#>  99:  o  J                
#> 100:  b  m                
#>      V1 V2              V3

# timing a much larger dataset
dt <- data.table(V1 = stri_rand_strings(2e7, 3),
                 V2 = stri_rand_strings(2e7, 3))

system.time(dt3 <- commonProducts(dt))
#>    user  system elapsed 
#>   72.75    3.05   71.88
dt3[lengths(V3) != 0L] # show only those pairs with common products
#>           V1  V2  V3
#>       1: GBW mDN lxF
#>       2: ix6 jpR 0VI
#>       3: xLG VeE aik
#>       4: A36 RzJ YYu
#>       5: zAo OYu zAo
#>      ---            
#> 1841567: qX9 xrW 7lb
#> 1841568: knO 2G6 knO
#> 1841569: rsU 5Rw ER8
#> 1841570: Bts 3L1 1bQ
#> 1841571: c0h pgd jxJ

这处理 V1==V2 时创建的“双边三角形”(与 OP 示例数据中的第 17 行一样)。例如,如果整个数据集由 (t, i)(i, i) 对组成,那么 i 将是 (t, i) 的共同产品(i 与两者配对ti),i, t 将是 (i, i) 的常见产品(it 分别与 ii).

我相信你可以坚持data.table来完成你想做的事情。正如评论者已经指出的那样,您需要确定产品配对 [a,b] 是否等同于 [b,a](在您的示例答案中,它仅适用于配对 [a,b])。无论如何,这个答案的瓶颈是 Map() 调用;您可以使用 future_Map() 提高速度,但您必须测试您的实际数据以查看是否需要它。

我还想标记一下,我将常用产品列保留为 list-column,尽管您可能希望它采用不同的格式。现在,当没有匹配项时,它是 NULL/空字符列的混合体,因此如果您将其保留为 list-column-- 由您决定,您可能需要清理它。

解决方法:

dt_unique = unique(dt[, .(V1, V2)])
dt_pairs = dt_unique[, list(ref_list = list(unique(V2))), .(product = V1)]

dt_unique = dt_pairs[dt_unique, on = c("product" = "V2")]
setnames(dt_unique, c("V2", "V2_ref", "V1"))
dt_unique = dt_pairs[dt_unique, on = c("product" = "V1")]
setnames(dt_unique, c("V1", "V1_ref", "V2", "V2_ref"))

dt_unique[, common_prods := Map(function(x, y) unique.default(y[chmatch(x, y, 0L)]), V1_ref, V2_ref)]

dt_unique[, c("V1_ref", "V2_ref") := NULL]
dt_unique[dt, on = c("V1", "V2")]

     V1 V2 common_prods correct_common_prods
  1:  G  e                                  
  2:  N  L                                  
  3:  Z  G                                  
  4:  u  z                                  
  5:  C  d                                  
  6:  t  D                                  
  7:  w  8                                  
  8:  e  T                                  
  9:  d  v                                  
 10:  3  b                                  
 11:  C  y                                  
 12:  A  j                                  
 13:  g  M                                  
 14:  N  Q                                  
 15:  l  9                                  
 16:  U  0                                  
 17:  i  i      i,z,B,l           i, z, B, l
 18:  z  6                                  
 19:  N  R                                  
 20:  m  d                                  
 21:  v  z                                  
 22:  D  U                                  
 23:  e  U                                  
 24:  7  A                                  
 25:  G  k                                  
 26:  N  S            R                    R
 27:  0  V                                  
 28:  N  C                                  
 29:  r  E                                  
 30:  L  a                                  
 31:  T  Z                                  
 32:  b  4                                  
 33:  U  2                                  
 34:  B  d                                  
 35:  p  v                                  
 36:  f  b                                  
 37:  n  Y                                  
 38:  6  W                                  
 39:  i  z                                  
 40:  P  V                                  
 41:  o  g                                  
 42:  e  b                                  
 43:  m  E                                  
 44:  Y  G                                  
 45:  W  j                                  
 46:  m  S                                  
 47:  1  A                                  
 48:  T  k                                  
 49:  j  6                                  
 50:  g  r                                  
 51:  T  c                                  
 52:  r  Y                                  
 53:  R  K                                  
 54:  F  S                                  
 55:  4  V                                  
 56:  6  B                                  
 57:  J  W                                  
 58:  W  4                                  
 59:  f  H                                  
 60:  P  D                                  
 61:  u  H                                  
 62:  I  t            t                    t
 63:  S  R                                  
 64:  K  m                                  
 65:  e  s                                  
 66:  F  P                                  
 67:  T  3                                  
 68:  l  K                                  
 69:  5  i            i                    i
 70:  s  K                                  
 71:  L  d                                  
 72:  q  q            q                    q
 73:  L  r            d                    d
 74:  K  O                                  
 75:  T  N                                  
 76:  t  t          D,t                 D, t
 77:  r  d                                  
 78:  O  j                                  
 79:  m  b                                  
 80:  x  t            t                    t
 81:  Q  I                                  
 82:  i  B                                  
 83:  O  s                                  
 84:  K  V                                  
 85:  k  s                                  
 86:  C  B            d                    d
 87:  i  l                                  
 88:  7  i            i                    i
 89:  F  w                                  
 90:  8  X                                  
 91:  E  i            i                    i
 92:  3  O                                  
 93:  d  6                                  
 94:  s  v                                  
 95:  m  H                                  
 96:  n  a                                  
 97:  S  6                                  
 98:  P  q            q                    q
 99:  o  J                                  
100:  b  m                                  
     V1 V2 common_prods correct_common_prods

可重现代码(带注释):

library(data.table)

n = 1e2
set.seed(1)

dt <- data.table(V1 = stringi::stri_rand_strings(n, 1),
                 V2 = stringi::stri_rand_strings(n, 1))

#Matching your output:
find_products <- function(a, b){
  library(data.table)
  toString(unique((dt[.(c(a, b)), on=.(V1), V2[duplicated(V2)]])))
}
dt[, correct_common_prods := apply(dt, 1, function(y) find_products(y[['V1']], y[['V2']]))]


# If (a, b) and (b, a) are equivalent, you'll want this instead:
# dt_unique = unique(rbindlist(list(dt[, .(V1, V2)], dt[, .(V2, V1)]), use.names = FALSE))
dt_unique = unique(dt[, .(V1, V2)])

# Creating list-column w/ corresponding products
dt_pairs = dt_unique[, list(ref_list = list(unique(V2))), .(product = V1)]

# Merging and re-naming. There may be a more data.table way to 
# handle the renaming because this feels not-eloquent
dt_unique = dt_pairs[dt_unique, on = c("product" = "V2")]
setnames(dt_unique, c("V2", "V2_ref", "V1"))
dt_unique = dt_pairs[dt_unique, on = c("product" = "V1")]
setnames(dt_unique, c("V1", "V1_ref", "V2", "V2_ref"))

# This is the memory-intensive part because it checks for the intersection on
# each row. This creates a list-column `common_prods`
# OR, easier to read but slower: 
# dt_unique[, common_prods := Map(intersect, V1_ref, V2_ref)]
dt_unique[, common_prods := Map(function(x, y) unique.default(y[chmatch(x, y, 0L)]), V1_ref, V2_ref)]

# Column cleanup (retain _ref columns to better understand how this works)
# then merging in the common products
dt_unique[, c("V1_ref", "V2_ref") := NULL]
dt = dt_unique[dt, on = c("V1", "V2")]

这个问题看起来像 购物篮分析,所以我建议这样处理它。

随机样本数据的问题在于,由于您的数据是随机的,因此您很可能找不到产品之间的任何强相关性;-)。 但也许(希望)你的 production-dataset(s).

不会是这种情况

datacamp 上有一个关于篮子分析的很棒的教程,我(主要)遵循了这个答案。有关更多 in-depth 信息,请确保自己按照教程进行操作。 Link 在 de 代码下面的注释中。

# Workflow adapted from 
#   https://www.datacamp.com/community/tutorials/market-basket-analysis-r
library(arules)
library(arulesViz)
library(tidyverse)

# Put items bought together intoa single column, separate with comma
#  then convert into transactions
write.csv(setDF(dt) %>% unite("items", everything(), sep = ","),
          "./temp/my_transactions.csv", 
          quote = FALSE, row.names = FALSE)
# Now read the csv into a transactions file
tr <- read.transactions("./temp/my_transactions.csv", 
                        format = "basket", sep = ",")

# Frequency plot (for getting insight only, so commented out)
#   itemFrequencyPlot(tr, topN = 15, type = "absolute", main="Absolute Item Frequency Plot")
#   itemFrequencyPlot(tr, topN = 15, type = "relative", main="Absolute Item Frequency Plot")

# Mine rules
# since your sample data is rather small and ramdom.. 
# there will not be many items bought together frequently...
# so I set the confidence to 0.5 (i.e. 50%).
# If your data set grows, you should/can increase the confidence 
# to get more reliable pairing
association.rules <- apriori(tr, parameter = list(supp = 0.001, conf = 0.5, maxlen = 100))

# what have we found here?
inspect(association.rules)
#      lhs    rhs support    confidence coverage   lift      count
# [1]  {x} => {t} 0.00990099 1.0        0.00990099 25.250000 1    
# [2]  {5} => {i} 0.00990099 1.0        0.00990099 14.428571 1    
# [3]  {c} => {T} 0.00990099 1.0        0.00990099 16.833333 1    
# [4]  {1} => {A} 0.00990099 1.0        0.00990099 33.666667 1    
# [5]  {p} => {v} 0.00990099 1.0        0.00990099 25.250000 1    
# [6]  {2} => {U} 0.00990099 1.0        0.00990099 25.250000 1    
# [7]  {9} => {l} 0.00990099 1.0        0.00990099 33.666667 1    
# [8]  {M} => {g} 0.00990099 1.0        0.00990099 33.666667 1    
# [9]  {y} => {C} 0.00990099 1.0        0.00990099 25.250000 1    
# [10] {X} => {8} 0.00990099 1.0        0.00990099 50.500000 1    
# [11] {8} => {X} 0.00990099 0.5        0.01980198 50.500000 1    
# [12] {q} => {P} 0.00990099 0.5        0.01980198 12.625000 1    
# [13] {a} => {n} 0.00990099 0.5        0.01980198 25.250000 1    
# [14] {n} => {a} 0.00990099 0.5        0.01980198 25.250000 1    
# [15] {a} => {L} 0.00990099 0.5        0.01980198 12.625000 1    
# [16] {0} => {U} 0.00990099 0.5        0.01980198 12.625000 1    
# [17] {0} => {V} 0.00990099 0.5        0.01980198 12.625000 1    
# [18] {u} => {H} 0.00990099 0.5        0.01980198 16.833333 1    
# [19] {u} => {z} 0.00990099 0.5        0.01980198 12.625000 1    
# [20] {Q} => {I} 0.00990099 0.5        0.01980198 25.250000 1    
# [21] {I} => {Q} 0.00990099 0.5        0.01980198 25.250000 1    
# [22] {Q} => {N} 0.00990099 0.5        0.01980198  8.416667 1    
# [23] {Z} => {G} 0.00990099 0.5        0.01980198 12.625000 1    
# [24] {Z} => {T} 0.00990099 0.5        0.01980198  8.416667 1    
# [25] {f} => {H} 0.00990099 0.5        0.01980198 16.833333 1    
# [26] {f} => {b} 0.00990099 0.5        0.01980198  8.416667 1    
# [27] {n} => {Y} 0.00990099 0.5        0.01980198 16.833333 1    
# [28] {o} => {J} 0.00990099 0.5        0.01980198 25.250000 1    
# [29] {J} => {o} 0.00990099 0.5        0.01980198 25.250000 1    
# [30] {o} => {g} 0.00990099 0.5        0.01980198 16.833333 1    
# [31] {J} => {W} 0.00990099 0.5        0.01980198 12.625000 1    
# [32] {I} => {t} 0.00990099 0.5        0.01980198 12.625000 1    
# [33] {w} => {8} 0.00990099 0.5        0.01980198 25.250000 1    
# [34] {8} => {w} 0.00990099 0.5        0.01980198 25.250000 1    
# [35] {w} => {F} 0.00990099 0.5        0.01980198 16.833333 1    
# [36] {7} => {A} 0.00990099 0.5        0.01980198 16.833333 1    
# [37] {7} => {i} 0.00990099 0.5        0.01980198  7.214286 1 

# what does the above mean?
#    [1] 100% of people that have bought x, also bought t
#   [11]  50% of people that have bought 8, also bought X

# visualise
plot(association.rules, method = "graph",  engine = "htmlwidget")

plot(association.rules, method="paracoord")

如果我误解了这个问题,我深表歉意(它确实重现了示例,fwiw),但这似乎作为一个简单的连接可能更有效。对于 n = 10M,这将在大约 3 秒内运行。

我的 data.table 生锈了,所以我正在使用 dtplyr 将我的 dplyr 语法转换为 data.table 语法。我想有 data.table 优化可以使它更快,或者你可以使用 collapse 包。

设置

library(data.table)
library(dplyr)   
library(dtplyr)    
library(stringi)
n = 10000000
set.seed(1)
dt <- data.table(V1 = stri_rand_strings(n, 2),  # 3,844 groups
                 V2 = stri_rand_strings(n, 2))

将 V2 加入每个关联的 V1

dt_single <- distinct(dt, V1)
dt_single %>%
  left_join(dt, by = c("V1" = "V1")) %>%
  group_by(V1) %>%
  summarize(common_products = paste(V2, sep = ", ", collapse = ", "))

要获得与 OP 格式相同的输出,我们可以将上面的代码分配给 -> common 然后执行:

dt %>%
  left_join(common) %>%
  mutate(common_products = if_else(V1 == V2, common_products, "")) %>% 
  select(V1, V2, common_products) 

获得 OP 所需的输出,据我所知:

   V1    V2    common_products
   <chr> <chr> <chr>          
 1 G     e                   
 2 N     L                   
 3 Z     G                   
 4 u     z                   
 5 C     d                   
 6 t     D                   
 7 w     8                   
 8 e     T                   
 9 d     v                   
10 3     b                   
11 C     y                   
12 A     j                   
13 g     M                   
14 N     Q                   
15 l     9                   
16 U     0                   
17 i     i     i, z, B, l  
...