将一个数据帧的每一行与 tidyverse 中另一个数据帧的每一行进行比较
Comparing each row from one data frame with each row of another one in the tidyverse
我需要将一个数据帧的每一行与另一个数据帧的每一行进行比较:
id first_name last_name account_nr amount currency comment
1 wW3A4QgpQQd Lynnett Labadini ES46 2569 1625 6669 5490 4624 9655.56 JPY Z617
2 LsoPIXEMOo5 Velvet Ritelli FR60 4478 1591 96PB SIMI FSTO L13 6992.36 PHP L841
3 L2wBds77Pw8 orv matfin LB61 6941 CQYE ONER G5T0 KNDU JU5H 6184.38 CAD o705
4 ME4O9MKlOzO ring hecks BG28 JYPB 4068 09NB FQ7I 6C 4203.54 IDR Y548
5 d83N7Viwq8k judd Riddick IL36 2200 2898 6944 4508 084 3619.43 IDR O762
6 1l96680epEy Edouard Kapovski IS73 1064 6186 1231 6178 3743 49 5291.76 BRL T397
7 7JwvD23oMzC Jake Rabinovich KZ80 759G VOHS JHBY L5TY 6994.26 NGN Y784
8 ZOcg2uprlN6 vere gravener SE39 1416 1830 7878 5026 6805 5281.18 UAH Z890
9 AUrx3nYR2Ks Bob Kelso VS41 5146 7748 1278 5362 4324.12 USD W312
10 VrDS+DqRG4S1 Mitch Mitchell AT65 6306 7334 7478 1908 4221.59 EUR T352
另一个
id first_name last_name amount currency comment recipient
1 xGZx1tNE4oa Lynnett Labadini 9655.56 JPY Z617 72
2 nV7NtxiguPQ Velvet Ritelli 6992.36 PHP L841 175
3 Rto0EHOR17k Orv Matfin 6184.38 CAD O705 412
4 2VDMHTJnxcw Ring Hecks 4203.54 IDR Y548 63
5 VQI7I0EZf1q Judd Riddick 3619.43 IDR O163 39
6 w835JEfmJvZ Edouard Avramovic 5291.76 BRL T397 240
7 of2FZZXFKY8 Ferdy Petracchi 6994.26 NGN Y784 102
8 XgUZFhKowB1 Vere Gravener 5281.18 IDR U024 111
9 iGO9advyXP3 Temp McKeevers 7364.49 TND R404 327
10 5BCiYQVhfxM Arnie Ashdown 4221.59 ZAR N988 262
我想用 tidyverse 来做,但也可以用另一种方式。我不想使用循环。 ID 中没有匹配项。任务是对 first_name, last_name, amount, currency, comment
列进行某种模糊连接。我看到的一种方法是将第一个数据帧的每一行 nrow
时间扩展到另一个数据帧的每一行并使用地图,但我认为它的内存效率非常低。
使用 fuzzyjoin
查看我的解决方案。它基本上确实将左边的每一行都分散到右边的每一行,因为我设置了一个高 (10) max_dist 但是如果你不想要糟糕的匹配,你可以降低它。然后它使用 group_by
和 top_n
为第一个数据帧中的每个 first_name 和 last_name 挑选出最佳匹配。
我添加了您的 "mismatch" 和 "label" 条件(请参阅前两列)。您可以调整匹配函数选项(现在它使用特定的 stringdist 方法 Levenshtein 比较您指定的五列的字符串距离)。
此外,Bob Kelso 出现了两次,因为最佳匹配是在 2 个(差)匹配之间并列的,因此算法无法从同样差的匹配中选择一个。
library(tidyverse); library(fuzzyjoin)
# Load data
df1 <- tibble::tribble(
~id, ~first_name, ~last_name, ~account_nr, ~amount, ~currency, ~comment,
"wW3A4QgpQQd", "Lynnett", "Labadini", "ES46 2569 1625 6669 5490 4624", 9655.56, "JPY", "Z617",
"LsoPIXEMOo5", "Velvet", "Ritelli", "FR60 4478 1591 96PB SIMI FSTO L13", 6992.36, "PHP", "L841",
"L2wBds77Pw8", "orv", "matfin", "LB61 6941 CQYE ONER G5T0 KNDU JU5H", 6184.38, "CAD", "o705",
"ME4O9MKlOzO", "ring", "hecks", "BG28 JYPB 4068 09NB FQ7I 6C", 4203.54, "IDR", "Y548",
"d83N7Viwq8k", "judd", "Riddick", "IL36 2200 2898 6944 4508 084", 3619.43, "IDR", "O762",
"1l96680epEy", "Edouard", "Kapovski", "IS73 1064 6186 1231 6178 3743 49", 5291.76, "BRL", "T397",
"7JwvD23oMzC", "Jake", "Rabinovich", "KZ80 759G VOHS JHBY L5TY", 6994.26, "NGN", "Y784",
"ZOcg2uprlN6", "vere", "gravener", "SE39 1416 1830 7878 5026 6805", 5281.18, "UAH", "Z890",
"AUrx3nYR2Ks", "Bob", "Kelso", "VS41 5146 7748 1278 5362", 4324.12, "USD", "W312",
"VrDS+DqRG4S1", "Mitch", "Mitchell", "AT65 6306 7334 7478 1908", 4221.59, "EUR", "T352"
)
df2 <- tibble::tribble(
~id, ~first_name, ~last_name, ~amount, ~currency, ~comment, ~recipient,
"xGZx1tNE4oa", "Lynnett", "Labadini", 9655.56, "JPY", "Z617", 72,
"nV7NtxiguPQ", "Velvet", "Ritelli", 6992.36, "PHP", "L841", 175,
"Rto0EHOR17k", "Orv", "Matfin", 6184.38, "CAD", "O705", 412,
"2VDMHTJnxcw", "Ring", "Hecks", 4203.54, "IDR", "Y548", 63,
"VQI7I0EZf1q", "Judd", "Riddick", 3619.43, "IDR", "O163", 39,
"w835JEfmJvZ", "Edouard", "Avramovic", 5291.76, "BRL", "T397", 240,
"of2FZZXFKY8", "Ferdy", "Petracchi", 6994.26, "NGN", "Y784", 102,
"XgUZFhKowB1", "Vere", "Gravener", 5281.18, "IDR", "U024", 111,
"iGO9advyXP3", "Temp", "McKeevers", 7364.49, "TND", "R404", 327,
"5BCiYQVhfxM", "Arnie", "Ashdown", 4221.59, "ZAR", "N988", 262
)
# Solution using fuzzyjoin
stringdist_left_join(df1, df2, by = c("first_name", "last_name", "amount", "currency", "comment"),
max_dist = 10, ignore_case = TRUE, method = "lv", distance_col = "dist") %>%
mutate(total.dist = first_name.dist + last_name.dist + amount.dist + currency.dist + comment.dist) %>%
group_by(first_name.x, last_name.x) %>%
top_n(-1, total.dist) %>%
mutate(mismatch = (first_name.dist>0) + (last_name.dist>0) + (amount.dist>0) + (currency.dist>0) + (comment.dist>0),
label = case_when(mismatch == 0 ~ "match",
mismatch == 1 ~ "high",
mismatch == 2 ~ "proposed",
mismatch > 2 ~ "none",
TRUE ~ "")) %>%
select(label, mismatch, total.dist, everything())
#> # A tibble: 11 x 22
#> # Groups: first_name.x, last_name.x [10]
#> label mismatch total.dist id.x first_name.x last_name.x account_nr
#> <chr> <int> <dbl> <chr> <chr> <chr> <chr>
#> 1 match 0 0 wW3A~ Lynnett Labadini ES46 2569~
#> 2 match 0 0 LsoP~ Velvet Ritelli FR60 4478~
#> 3 match 0 0 L2wB~ orv matfin LB61 6941~
#> 4 match 0 0 ME4O~ ring hecks BG28 JYPB~
#> 5 high 1 2 d83N~ judd Riddick IL36 2200~
#> 6 high 1 7 1l96~ Edouard Kapovski IS73 1064~
#> 7 prop~ 2 14 7Jwv~ Jake Rabinovich KZ80 759G~
#> 8 prop~ 2 7 ZOcg~ vere gravener SE39 1416~
#> 9 none 5 20 AUrx~ Bob Kelso VS41 5146~
#> 10 none 5 20 AUrx~ Bob Kelso VS41 5146~
#> 11 none 4 19 VrDS~ Mitch Mitchell AT65 6306~
#> # ... with 15 more variables: amount.x <dbl>, currency.x <chr>,
#> # comment.x <chr>, id.y <chr>, first_name.y <chr>, last_name.y <chr>,
#> # amount.y <dbl>, currency.y <chr>, comment.y <chr>, recipient <dbl>,
#> # amount.dist <dbl>, comment.dist <dbl>, currency.dist <dbl>,
#> # first_name.dist <dbl>, last_name.dist <dbl>
由 reprex package (v0.2.1)
于 2019-03-17 创建
我需要将一个数据帧的每一行与另一个数据帧的每一行进行比较:
id first_name last_name account_nr amount currency comment
1 wW3A4QgpQQd Lynnett Labadini ES46 2569 1625 6669 5490 4624 9655.56 JPY Z617
2 LsoPIXEMOo5 Velvet Ritelli FR60 4478 1591 96PB SIMI FSTO L13 6992.36 PHP L841
3 L2wBds77Pw8 orv matfin LB61 6941 CQYE ONER G5T0 KNDU JU5H 6184.38 CAD o705
4 ME4O9MKlOzO ring hecks BG28 JYPB 4068 09NB FQ7I 6C 4203.54 IDR Y548
5 d83N7Viwq8k judd Riddick IL36 2200 2898 6944 4508 084 3619.43 IDR O762
6 1l96680epEy Edouard Kapovski IS73 1064 6186 1231 6178 3743 49 5291.76 BRL T397
7 7JwvD23oMzC Jake Rabinovich KZ80 759G VOHS JHBY L5TY 6994.26 NGN Y784
8 ZOcg2uprlN6 vere gravener SE39 1416 1830 7878 5026 6805 5281.18 UAH Z890
9 AUrx3nYR2Ks Bob Kelso VS41 5146 7748 1278 5362 4324.12 USD W312
10 VrDS+DqRG4S1 Mitch Mitchell AT65 6306 7334 7478 1908 4221.59 EUR T352
另一个
id first_name last_name amount currency comment recipient
1 xGZx1tNE4oa Lynnett Labadini 9655.56 JPY Z617 72
2 nV7NtxiguPQ Velvet Ritelli 6992.36 PHP L841 175
3 Rto0EHOR17k Orv Matfin 6184.38 CAD O705 412
4 2VDMHTJnxcw Ring Hecks 4203.54 IDR Y548 63
5 VQI7I0EZf1q Judd Riddick 3619.43 IDR O163 39
6 w835JEfmJvZ Edouard Avramovic 5291.76 BRL T397 240
7 of2FZZXFKY8 Ferdy Petracchi 6994.26 NGN Y784 102
8 XgUZFhKowB1 Vere Gravener 5281.18 IDR U024 111
9 iGO9advyXP3 Temp McKeevers 7364.49 TND R404 327
10 5BCiYQVhfxM Arnie Ashdown 4221.59 ZAR N988 262
我想用 tidyverse 来做,但也可以用另一种方式。我不想使用循环。 ID 中没有匹配项。任务是对 first_name, last_name, amount, currency, comment
列进行某种模糊连接。我看到的一种方法是将第一个数据帧的每一行 nrow
时间扩展到另一个数据帧的每一行并使用地图,但我认为它的内存效率非常低。
使用 fuzzyjoin
查看我的解决方案。它基本上确实将左边的每一行都分散到右边的每一行,因为我设置了一个高 (10) max_dist 但是如果你不想要糟糕的匹配,你可以降低它。然后它使用 group_by
和 top_n
为第一个数据帧中的每个 first_name 和 last_name 挑选出最佳匹配。
我添加了您的 "mismatch" 和 "label" 条件(请参阅前两列)。您可以调整匹配函数选项(现在它使用特定的 stringdist 方法 Levenshtein 比较您指定的五列的字符串距离)。
此外,Bob Kelso 出现了两次,因为最佳匹配是在 2 个(差)匹配之间并列的,因此算法无法从同样差的匹配中选择一个。
library(tidyverse); library(fuzzyjoin)
# Load data
df1 <- tibble::tribble(
~id, ~first_name, ~last_name, ~account_nr, ~amount, ~currency, ~comment,
"wW3A4QgpQQd", "Lynnett", "Labadini", "ES46 2569 1625 6669 5490 4624", 9655.56, "JPY", "Z617",
"LsoPIXEMOo5", "Velvet", "Ritelli", "FR60 4478 1591 96PB SIMI FSTO L13", 6992.36, "PHP", "L841",
"L2wBds77Pw8", "orv", "matfin", "LB61 6941 CQYE ONER G5T0 KNDU JU5H", 6184.38, "CAD", "o705",
"ME4O9MKlOzO", "ring", "hecks", "BG28 JYPB 4068 09NB FQ7I 6C", 4203.54, "IDR", "Y548",
"d83N7Viwq8k", "judd", "Riddick", "IL36 2200 2898 6944 4508 084", 3619.43, "IDR", "O762",
"1l96680epEy", "Edouard", "Kapovski", "IS73 1064 6186 1231 6178 3743 49", 5291.76, "BRL", "T397",
"7JwvD23oMzC", "Jake", "Rabinovich", "KZ80 759G VOHS JHBY L5TY", 6994.26, "NGN", "Y784",
"ZOcg2uprlN6", "vere", "gravener", "SE39 1416 1830 7878 5026 6805", 5281.18, "UAH", "Z890",
"AUrx3nYR2Ks", "Bob", "Kelso", "VS41 5146 7748 1278 5362", 4324.12, "USD", "W312",
"VrDS+DqRG4S1", "Mitch", "Mitchell", "AT65 6306 7334 7478 1908", 4221.59, "EUR", "T352"
)
df2 <- tibble::tribble(
~id, ~first_name, ~last_name, ~amount, ~currency, ~comment, ~recipient,
"xGZx1tNE4oa", "Lynnett", "Labadini", 9655.56, "JPY", "Z617", 72,
"nV7NtxiguPQ", "Velvet", "Ritelli", 6992.36, "PHP", "L841", 175,
"Rto0EHOR17k", "Orv", "Matfin", 6184.38, "CAD", "O705", 412,
"2VDMHTJnxcw", "Ring", "Hecks", 4203.54, "IDR", "Y548", 63,
"VQI7I0EZf1q", "Judd", "Riddick", 3619.43, "IDR", "O163", 39,
"w835JEfmJvZ", "Edouard", "Avramovic", 5291.76, "BRL", "T397", 240,
"of2FZZXFKY8", "Ferdy", "Petracchi", 6994.26, "NGN", "Y784", 102,
"XgUZFhKowB1", "Vere", "Gravener", 5281.18, "IDR", "U024", 111,
"iGO9advyXP3", "Temp", "McKeevers", 7364.49, "TND", "R404", 327,
"5BCiYQVhfxM", "Arnie", "Ashdown", 4221.59, "ZAR", "N988", 262
)
# Solution using fuzzyjoin
stringdist_left_join(df1, df2, by = c("first_name", "last_name", "amount", "currency", "comment"),
max_dist = 10, ignore_case = TRUE, method = "lv", distance_col = "dist") %>%
mutate(total.dist = first_name.dist + last_name.dist + amount.dist + currency.dist + comment.dist) %>%
group_by(first_name.x, last_name.x) %>%
top_n(-1, total.dist) %>%
mutate(mismatch = (first_name.dist>0) + (last_name.dist>0) + (amount.dist>0) + (currency.dist>0) + (comment.dist>0),
label = case_when(mismatch == 0 ~ "match",
mismatch == 1 ~ "high",
mismatch == 2 ~ "proposed",
mismatch > 2 ~ "none",
TRUE ~ "")) %>%
select(label, mismatch, total.dist, everything())
#> # A tibble: 11 x 22
#> # Groups: first_name.x, last_name.x [10]
#> label mismatch total.dist id.x first_name.x last_name.x account_nr
#> <chr> <int> <dbl> <chr> <chr> <chr> <chr>
#> 1 match 0 0 wW3A~ Lynnett Labadini ES46 2569~
#> 2 match 0 0 LsoP~ Velvet Ritelli FR60 4478~
#> 3 match 0 0 L2wB~ orv matfin LB61 6941~
#> 4 match 0 0 ME4O~ ring hecks BG28 JYPB~
#> 5 high 1 2 d83N~ judd Riddick IL36 2200~
#> 6 high 1 7 1l96~ Edouard Kapovski IS73 1064~
#> 7 prop~ 2 14 7Jwv~ Jake Rabinovich KZ80 759G~
#> 8 prop~ 2 7 ZOcg~ vere gravener SE39 1416~
#> 9 none 5 20 AUrx~ Bob Kelso VS41 5146~
#> 10 none 5 20 AUrx~ Bob Kelso VS41 5146~
#> 11 none 4 19 VrDS~ Mitch Mitchell AT65 6306~
#> # ... with 15 more variables: amount.x <dbl>, currency.x <chr>,
#> # comment.x <chr>, id.y <chr>, first_name.y <chr>, last_name.y <chr>,
#> # amount.y <dbl>, currency.y <chr>, comment.y <chr>, recipient <dbl>,
#> # amount.dist <dbl>, comment.dist <dbl>, currency.dist <dbl>,
#> # first_name.dist <dbl>, last_name.dist <dbl>
由 reprex package (v0.2.1)
于 2019-03-17 创建