R 中的快速部分匹配检查(或 Python 或 Julia)

fast partial match checking in R (or Python or Julia)

我有两个带名称的数据集,我需要比较两个数据集中的名称。我只需要根据名称保留两个数据集的并集。但是,如果一个名字是另一个名字的一部分,即使它不是完全匹配,它仍然被认为是 'matched',反之亦然。例如,“seb”应匹配“seb”,但也应匹配“sebas”。我正在使用 str_detect(),但它太慢了。我想知道是否有任何方法可以加快这个过程。我尝试了一些其他的包和函数,但并没有真正提高速度。我对任何 R 或 Python 解决方案持开放态度。

创建两个虚拟数据集

library(dplyr)
library(stringr)

set.seed(1)

data_set_A <- tibble(name =  unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>% 
  mutate(ID_A = 1:n())
                    
set.seed(2)

data_set_B <- tibble(name_2 =  unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>% 
  mutate(ID_B = 1:n())

仅测试完全匹配的匹配

# This is almost instant
data_set_A %>%
  rowwise() %>%
  filter(any(name %in% data_set_B$name_2) | any(data_set_B$name_2 %in% name)) %>%
  ungroup()
# A tibble: 4 x 2
  name   ID_A
  <chr> <int>
1 vnt     112
2 fly     391
3 cug    1125
4 xgv    1280

包括部分匹配(这是我要优化的)

这当然只给了我数据集A的子集,但没关系。

# This takes way too long
data_set_A %>%
  rowwise() %>%
  filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
  ungroup()
A tibble: 237 x 2
   name       ID_A
   <chr>     <int>
 1 wknrsauuj     2
 2 lyw           7
 3 igwsvrzpk    16
 4 zozxjpu      18
 5 cgn          22
 6 oqo          45
 7 gkritbe      47
 8 uuq          92
 9 lhwfyksz     94
10 tuw         100

模糊连接方法。

这也有效,但速度同样慢

bind_rows(
  fuzzyjoin::fuzzy_inner_join(
    data_set_A,
    data_set_B,
    by = c("name" = "name_2"),
    match_fun = stringr::str_detect
  ) %>%
    select(name, ID_A),
  fuzzyjoin::fuzzy_inner_join(
    data_set_B,
    data_set_A,
    by = c("name_2" = "name"),
    match_fun = stringr::str_detect
  ) %>%
    select(name, ID_A)
) %>%
  distinct()

data.table解决方案

不幸的是并没有快多少

library(data.table)

setDT(data_set_A)
setDT(data_set_B)

data_set_A[data_set_A[, .I[any(str_detect(name, data_set_B$name_2)) | 
                    any(str_detect(data_set_B$name_2, name))], by = .(ID_A)]$V1]

TL;DR

慢的部分是str_detect(string, pattern)

为了加快它的速度,如果你有简单的字符串,将 pattern 包装在 fixed() 中,如果你有更长的典型人类文本,则将其包装在 coll() 中。

要获得另一个轻微的速度提升,请使用 purrr::map_lgl() 重写您的代码并使用它来对您的数据进行子集化。

下面是示例、解释和基准。

使用 fixed()coll()

重写 str_detect()

我认为最简单的解决方法是修改 str_detect() 使用正则表达式的方式,例如stringr::fixed()stringr::coll().

来自 ?stringr::str_detect():

Match a fixed string (i.e. by comparing only bytes), using fixed(). This is fast, but approximate. Generally, for matching human text, you'll want coll() which respects character matching rules for the specified locale.

下面是与你原代码的对比:

original <- function() {
  
  data_set_A %>%
    rowwise() %>%
    filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
    ungroup()
  
}

# Note the use of fixed()
using_fixed <- function() {
  
  data_set_A %>%
    rowwise() %>%
    filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
    ungroup()
  
}

# Note the use of coll()
using_coll <- function() {
  
  data_set_A %>%
    rowwise() %>%
    filter(any(str_detect(name, coll(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, coll(name)))) %>%
    ungroup()
  
}


bm <- bench::mark(
  original(),
  using_fixed(),
  using_coll(),
  iterations = 20
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.

bm
#> # A tibble: 3 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 original()       6.58s    6.59s     0.152    32.4MB    0.371
#> 2 using_fixed() 501.64ms 505.51ms     1.97     61.4MB    3.94 
#> 3 using_coll()     4.48s     4.5s     0.222    61.4MB    0.512

bm %>% ggplot2::autoplot(type = "violin")
#> Loading required namespace: tidyr

reprex package (v2.0.1)

创建于 2022-04-02

因此,正如我们所见,将您的代码包装在 fixed() 中会使它非常快并且可以很好地处理您的测试数据。但是,它可能不适用于真实的人类文本(尤其是 non-ASCII 字符集)。您应该在原始数据上对其进行测试,如果 fixed() 不起作用,则使用 coll() 作为替代方案。

删除rowwise()

您可以采取的另一个使代码更快的步骤是摆脱 rowwise()。我将使用 purrr::map_lgl() 替换它并使用此逻辑向量对数据帧进行子集化。以下是针对我上面定义的函数的示例和基准:

using_map_fixed <- function() {
  
  logical_vec <- data_set_A$name %>% 
    purrr::map_lgl(
      ~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) || 
        any(stringr::str_detect(data_set_B$name_2, fixed(.x)))
    )
  
  
  data_set_A[logical_vec, ]
  
}


using_map_coll <- function() {
  
  logical_vec <- data_set_A$name %>% 
    purrr::map_lgl(
      ~any(stringr::str_detect(.x, coll(data_set_B$name_2))) || 
        any(stringr::str_detect(data_set_B$name_2, coll(.x)))
    )
  
  
  data_set_A[logical_vec, ]
  
}


bm <- bench::mark(
  using_fixed(),
  using_map_fixed(),
  using_coll(),
  using_map_coll(),
  iterations = 20
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.

bm
#> # A tibble: 4 × 6
#>   expression             min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 using_fixed()      503.4ms 507.24ms     1.95     62.9MB    5.37 
#> 2 using_map_fixed() 474.28ms 477.63ms     2.09     60.3MB    3.14 
#> 3 using_coll()         4.49s     4.5s     0.222    61.4MB    0.489
#> 4 using_map_coll()     4.37s    4.38s     0.228    60.2MB    0.354

reprex package (v2.0.1)

创建于 2022-04-02

如我们所见,这又带来了轻微的速度提升。

fixed()data.tablefuzzyjoin

结合使用

您还可以将 fixed()data.tablefuzzyjoin 一起使用。为简洁起见,我没有在此处包含它,但我的基准测试显示 data.table 花费的时间与上面的 using_map_fixed() 大致相同,而 fuzzyjoin 花费的时间大约是其两倍。

这对我来说很有意义,因为慢的部分是 str_detect(),而不是 joining/filtering 的方法或底层数据结构。

这是一个 [r] 选项,旨在减少您调用 str_detect() 的次数(即,您的示例很慢,因为该函数被调用了数千次;并且不使用 fixed()fixed = TRUE 正如 jpiversen 已经指出的那样)。答案在代码的注释中解释;明天我会试着继续解释更多。

与当前方法相比,这应该可以很好地扩展并且内存效率更高,因为将按行计算减少到绝对最小值。

基准:

n = 2000

# A tibble: 4 × 13
  expression             min   median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
1 original()           6.67s    6.67s     0.150   31.95MB    0.300     1
2 using_fixed()     496.54ms 496.54ms     2.01    61.39MB    4.03      1
3 using_map_fixed() 493.35ms 493.35ms     2.03    60.27MB    6.08      1
4 andrew_fun()      167.78ms 167.78ms     5.96     1.59MB    0         1

n = 4000

注意:我不确定你是否需要缩放的答案;但是减少 memory-intensive 部分的方法似乎确实做到了这一点(尽管对于 n = 4000 的 1 次迭代,IMO 的时间差可以忽略不计)。

# A tibble: 4 × 13
  expression             min   median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
1 original()          26.63s   26.63s    0.0376  122.33MB    0.150     1
2 using_fixed()        1.91s    1.91s    0.525   243.96MB    3.67      1
3 using_map_fixed()    1.87s    1.87s    0.534   236.62MB    3.20      1
4 andrew_fun()      674.36ms 674.36ms    1.48      7.59MB    0         1

带注释的代码:

# This is so we do not retain the strings with the max number of
# characters in our pattern because we are checking with %in% already
nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)

# Creating large patterns (excluding values w/ max number of characters)
pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")

# First checking using %in% 
idx_a = data_set_A$name %in% data_set_B$name_2

# Next, IDing when a(string) matches b(pattern) 
idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)

# IDing a(pattern) matches b(string) so we do not run every row of 
# a(as a pattern) against all of b
b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]

# Using unmatched values of a as a pattern for the reduced set for b
idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
  any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)

data_set_A[idx_a, ]
# A tibble: 237 × 2
   name       ID_A
   <chr>     <int>
 1 wknrsauuj     2
 2 lyw           7
 3 igwsvrzpk    16
 4 zozxjpu      18
 5 cgn          22
 6 oqo          45
 7 gkritbe      47
 8 uuq          92
 9 lhwfyksz     94
10 tuw         100
# … with 227 more rows

用于基准测试的可重现 R 代码

以下代码主要取自 jpiversen,他提供了很好的答案:

library(dplyr)
library(stringr)

n = 2000

set.seed(1)
data_set_A <- tibble(name =  unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>% 
  mutate(ID_A = 1:n())

set.seed(2)
data_set_B <- tibble(name_2 =  unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>% 
  mutate(ID_B = 1:n())


original <- function() {
  
  data_set_A %>%
    rowwise() %>%
    filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
    ungroup()
  
}

using_fixed <- function() {
  
  data_set_A %>%
    rowwise() %>%
    filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
    ungroup()
  
}

using_map_fixed <- function() {
  
  logical_vec <- data_set_A$name %>% 
    purrr::map_lgl(
      ~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) || 
        any(stringr::str_detect(data_set_B$name_2, fixed(.x)))
    )
  
  
  data_set_A[logical_vec, ]
  
}

andrew_fun = function() {
  
  nchar_a = nchar(data_set_A$name)
  nchar_b = nchar(data_set_B$name_2)
  
  pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
  pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
  
  idx_a = data_set_A$name %in% data_set_B$name_2
  
  idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)
  
  b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
  
  idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
    any(grepl(name, b_to_check, fixed = TRUE))
  }, logical(1L), USE.NAMES = FALSE)
  
  data_set_A[idx_a, ]
  
}


bm = bench::mark(
  original(),
  using_fixed(),
  using_map_fixed(),
  andrew_fun(),
  iterations = 1
)

如果您想使用 base R,下面的代码可能是一个快速的选择

A <- data_set_A$name
B <- data_set_B$name_2
A2B <- sapply(A, function(x) grepl(x, B, fixed = TRUE))
B2A <- sapply(B, function(x) grepl(x, A, fixed = TRUE))
idx <- which(t(A2B) | B2A, arr.ind = TRUE)
res <- cbind(data_set_A[idx[, 1], ], data_set_B[idx[, 2], ])

这给出了

> res
         name ID_A    name_2 ID_B
1         arh 1234  pimoarhd    8
2         qtj  720      aqtj   23
3    szcympsn  142       cym   43
4   cymvubnxg  245       cym   43
5   dppvtcymq  355       cym   43
6         kzi  690      kzii   48
7   eyajqchkn  498       chk   53
8       upfzh  522       upf   61
9         ioa 1852     ioadr   63
10        lya 1349 ibelyalvh   64
11      honod  504       ono   71
12    zozxjpu   18       zoz   72
13        jcz  914  cdjczpqg   88
14     ailmjf  623       ilm   99
15      upoux  609       oux  104
16   pouxifvp 1466       oux  104
17       mvob  516       vob  106
18  nqtotvhhm 1088       otv  115
19        wom  202 womtglapx  117
20        qkc  756 dqkcfqpps  118
21        qtl  600 ivqtlymzr  126
22        qqi 1605 owfsqqiyu  153
23  fmjalirze 1470       ali  172
24   ibwfwkyp 1588       fwk  175
25        iat 1258    iatjeg  185
26        osm  253  nviiqosm  199
27        wpj  373     wpjeb  204
28       hahx  515       ahx  213
29     keahxa 1565       ahx  213
30        psf  359    qnpsfo  223
31        saq 1859     saqhu  227
32    cvmkwtx  714       cvm  228
33        ilw  389    pyilwj  231
34     ohwysv 1590       ysv  237
35       utrl  698       trl  244
36  dmttrlcpj 1267       trl  244
37        cpv  236  btcpvmoc  247
38        uto 1047      utoi  257
39   yngunekl 1978       ekl  258
40      vceko  625       vce  265
41        fir 1934     firgk  278
42        qvd  983    eqvdfi  287
43        fir 1934   zwwefir  291
44  idvfkevdf 1380       vdf  312
45       qwdo 1921       qwd  322
46        kam 1205     tlkam  327
47        lck  488 clckjkyzn  329
48   gmspwckw 1015       msp  359
49  ynouuwqtz 1576       nou  360
50        tty 1209    bttyvt  361
51        vkc  999   fmrvkcl  366
52        ipw 1918 fipwjomdu  388
53        zdv  261    zdvkut  410
54        vku 1137    zdvkut  410
55       doby  246       oby  411
56  hycvuupgy  141       uup  421
57       uwlb 1249       wlb  431
58        auj 1452   lcmnauj  444
59        rwd 1667 ukwrwdczs  479
60  ylsihqqor 1290       ihq  483
61        feo 1649  feorvxbm  485
62        zff  755 dohzffujm  499
63  mqutujepu  904       epu  507
64      uiepu 1308       epu  507
65    vahepuk 1434       epu  507
66        cug 1125    accugl  509
67        fir 1934     firwe  517
68        dia 1599  dialeddd  527
69     temiwd 1725       tem  531
70    svofivl 1177       svo  545
71        flm  657      aflm  546
72        vnt  112       vnt  551
73   bhmoskrz  426       osk  558
74        wev  728  shemuwev  569
75       hzpi 1586       hzp  579
76        gvi 1064 mkgvivlfe  582
77        fjb 1398 vkfjbxnjl  589
78        qin 1013      qinp  593
79        ecn 1342    ecnzre  598
80        zre 1610    ecnzre  598
81        xvr  772  dpxvrfmo  623
82        tqr 1419  tqrmztdm  624
83      zmwnf 1571       mwn  626
84       ypil 1787       pil  630
85   mnxlqgfh 1132       nxl  643
86        gse 1563    gseice  646
87        ygk 1309    ygkqrk  655
88        fgm  933    vzfgmy  663
89      rlupd  977       upd  666
90  mcupdkuiy 1307       upd  666
91        fly  391       fly  669
92      vbkko 1603       kko  678
93      uvrew  465       rew  680
94  hgbhngwvd  901       wvd  690
95   wvdjprmo 1432       wvd  690
96        cgn   22      cgnd  698
97     dngnjv  967       njv  700
98       psqs  841       sqs  720
99        ywv 1180  ptywvlgc  730
100      ypil 1787       ypi  734
101       rwd 1667   srserwd  737
102   jqydasl 1294       jqy  742
103    ckujmc  717       ujm  751
104    dfzxta  662       xta  775
105       bjb 1562   jabjbei  779
106  adwknpll 1242       npl  780
107       kdv 1327   xhkdvqo  789
108       ghj  174      oghj  801
109  lhwfyksz   94       lhw  811
110  nwrrnlhw  929       lhw  811
111     xlhwm 1720       lhw  811
112       ncc 1602 wurhxnccn  814
113    jdslrf 1094       dsl  835
114      ktmw 1738       tmw  844
115 igwsvrzpk   16       gws  856
116       kug  591    pkugls  857
117 befgcpedr  339       fgc  862
118       ojf 1397  ojfpnkla  863
119       gyl 1203  gylxeqzw  872
120     ugcbb 1727       ugc  876
121       arh 1234   karhwhg  878
122       amm  458    ammqdc  883
123  azazryje  636       zaz  900
124    wczazw 1887       zaz  900
125   gkritbe   47      ritb  915
126       vku 1137 yjvkuxued  929
127       rnh 1633 kvyrnhugu  937
128       mzh 1135 xllmwmzhn  940
129       cug 1125       cug  960
130       xgv 1280       xgv  962
131    xusxgv 1436       xgv  962
132       umc  351 lwumcmvoo  980
133       zlb 1900   nkyazlb  991
134  llfkalao 1049       llf 1002
135   sflpbht  991       lpb 1048
136  rairmmcl  442       mmc 1087
137   mmckoln  780       mmc 1087
138  gfxmmcgb 1814       mmc 1087
139       aoj  402   taojlgp 1089
140   mypvzhp  121       ypv 1095
141 moctwaypv  611       ypv 1095
142    rngedn  306       ged 1106
143   djshecy 1408       ecy 1108
144  rairmmcl  442       rmm 1117
145      gzua 1594       zua 1124
146       ytj  416      yytj 1140
147       ubt  300   hubtcfr 1141
148       gqg 1854  ogqgsjqc 1144
149       tfg 1204  xiutfgru 1145
150      avrq  741       avr 1147
151   ytkpvss  440       tkp 1149
152       kug  591   yxsjkug 1176
153       vix 1846    vixsmn 1187
154       qtl  600   qtljkxz 1188
155       lgr  494    dlgrco 1189
156       ryg  864   xlmtryg 1203
157  yskvkxwj 1547       kvk 1205
158     kxhee 1795       xhe 1222
159    hzbcjs 1493       cjs 1224
160       kbi  270  itxlwkbi 1225
161   gdymcam  806      ymca 1232
162       tqr 1419  rxtqrdtl 1236
163       yyz  215      yyzw 1242
164       jyx 1735   mljjyxu 1248
165       aai 1928 umkpaaiwo 1254
166       dsd 1122  dndsdova 1257
167       tor  744      etor 1270
168   vhcyznp 1296       yzn 1278
169       xlc 1947  odxlcjwj 1280
170       mlm 1629  aomlmgtq 1303
171       owm  239    owmugb 1304
172  ynezwaml  507       nez 1308
173       jls  695     jlsve 1325
174       dvm  879      dvmv 1339
175      vsgx  944 dqpihvsgx 1352
176       wfo  768 wfokpjois 1354
177 tltbkinat 1986       nat 1362
178       gyl 1203    gylqte 1363
179       ngg  735 bsnggqbjd 1366
180       fkq  345    jdfkqf 1368
181       ojf 1397  ojfpgfga 1382
182      dqgd 1623 prqbndqgd 1398
183       siu  827 siuypucup 1412
184 yinsoivfd 1895       yin 1414
185       esm 1834  sesmeepz 1417
186       umc  351      umcj 1432
187       wny  866 wnyxamguw 1443
188 ujbhtvnin  399       vni 1444
189       dbq  630     bdbqq 1452
190       ebn 1405   ebngddw 1461
191       zcj  704 rbtjzcjod 1465
192       avn  500   avnspxv 1468
193       vkk  567      hvkk 1477
194       hmm 1441 bgjhmmthz 1483
195    aguakz  614      guak 1487
196 hycvuupgy  141       pgy 1493
197  tizpgymz  280       pgy 1493
198       guk  571  cncxdguk 1502
199       zyw  281   nzywuqs 1504
200       jnz 1558 rxdxsjnzw 1510
201       uuq   92   nxuuqtj 1514
202       qtj  720   nxuuqtj 1514
203       vkk  567 xpbpvkkdc 1518
204       iaa  460     sjiaa 1525
205 txsgmynng 1019       xsg 1526
206    yjvtwc 1107       jvt 1529
207       lnk 1113  hylnknwy 1546
208       szd  635    woszdm 1557
209       osm  253    sosmdp 1567
210       nbd 1067    nbdmmg 1570
211       mmg 1305    nbdmmg 1570
212  wqdsatbd 1536       sat 1585
213    sdlypo 1527       sdl 1596
214   inkynog  288      inky 1600
215 hpwoeclfy 1321       clf 1601
216  wodyqwqf  679       dyq 1603
217       lyw    7 xnalywyuw 1607
218       njm 1825  vjlnjmns 1617
219 njytqhaut  428       qha 1620
220       ilw  389    rilwbk 1647
221       oqo   45 ixoqowkpg 1650
222 odcbcvaun 1386       bcv 1652
223     mastn  434       stn 1662
224 xebhdssit 1091       xeb 1663
225       nmy  782     nmyxj 1671
226   fsqvgdw  673       gdw 1676
227   mwwczhs  482       wcz 1679
228    wczazw 1887       wcz 1679
229   anmryzm  915       ryz 1698
230      rteh  523       rte 1708
231  mlwrguae  817       lwr 1709
232       mbu  819 xpsuqmbuf 1729
233   mmckoln  780       cko 1733
234      lxpg  798       lxp 1734
235       ane  370 vxnanehvk 1746
236       tty 1209 vbttyozui 1752
237 igncdgyjx  332       ign 1753
238    ndignk  621       ign 1753
239       nmy  782   ivnmyba 1780
240 wknrsauuj    2       rsa 1799
241       tgd  165  qtgdidlf 1803
242       iaa  460 yziaazxto 1833
243       xto 1245 yziaazxto 1833
244       zff  755    dpzfft 1857
245       jyx 1735  jwjyxphe 1873
246       ytj  416      eytj 1881
247  lcggwonk 1596       onk 1882
248       zdv  261    zdvxfz 1889
249    xhskcb  417       kcb 1890
250  mrikqkcb  770       kcb 1890
251 psvxqnsap 1352       psv 1898
252   udjswzb  411       jsw 1900
253   rpfjswy 1840       jsw 1900
254  bjaywiso 1677       ayw 1902
255      zfli  130       fli 1906
256      vazx 1215   itvazxw 1918
257       tuw  100 tuwywtbwd 1921
258       vle 1437 ebvleaovm 1937
259  znycsygd 1757       nyc 1944
260  ynezwaml  507       ezw 1952
261   tseezwf 1276       ezw 1952
262 ezwzyfudo 1690       ezw 1952
263    oudiky 1503       dik 1964
264     dikjn 1615       dik 1964
265       oms  106  wpomsudi 1977
266       hhp 1864     hhpkm 1983

基准测试

这个基本 R 选项似乎比 稍慢。

TIC <- function() {
    A <- data_set_A$name
    B <- data_set_B$name_2
    A2B <- sapply(A, function(x) grepl(x, B, fixed = TRUE))
    B2A <- sapply(B, function(x) grepl(x, A, fixed = TRUE))
    idx <- which(t(A2B) | B2A, arr.ind = TRUE)
    cbind(data_set_A[idx[, 1], ], data_set_B[idx[, 2], ])
    # data_set_A[unique(idx[, 1]), ]
}

jpiversen_fixed <- function() {
    data_set_A %>%
        rowwise() %>%
        filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
        ungroup()
}

andrew <- function() {
    nchar_a <- nchar(data_set_A$name)
    nchar_b <- nchar(data_set_B$name_2)

    pattern_a <- str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
    pattern_b <- str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")

    idx_a <- data_set_A$name %in% data_set_B$name_2

    idx_a[!idx_a] <- str_detect(data_set_A$name[!idx_a], pattern_b)

    b_to_check <- data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]

    idx_a[!idx_a] <- vapply(data_set_A$name[!idx_a], function(name) {
        any(grepl(name, b_to_check, fixed = TRUE))
    }, logical(1L), USE.NAMES = FALSE)

    data_set_A[idx_a, ]
}

bm <- microbenchmark(
    TIC(),
    jpiversen_fixed(),
    andrew(),
    times = 20
)

表明

> bm
Unit: milliseconds
              expr       min        lq      mean    median        uq       max
             TIC()  423.8410  441.3574  492.6091  478.2596  549.2376  611.3841
 jpiversen_fixed() 1354.8954 1373.9502 1447.8649 1395.6766 1459.7058 1842.2574
          andrew()  329.4821  335.3388  345.8890  341.4758  354.1298  381.6872
 neval
    20
    20
    20