通过 stringmatch 与 dplyr 和 stringdist 合并两个数据帧
Merging two dataframes by stringmatch with dplyr and stringdist
我正在尝试基于非常相似的语言(不准确)对两个数据帧进行 dplyr 左连接。
DF1:
title | records
Bob's show, part 1 | 42
Time for dinner | 77
Horsecrap | 121
DF2:
showname | counts
Bob's show part 1 | 772
Dinner time | 89
No way Jose | 123
我执行此操作以使用 stringdist package/library:
将字符串距离作为向量获取
titlematch <- amatch(df1$title,df2$showname)
向量看起来像...好吧,一个整数向量:
titlematch
1
2
NA
通常情况下,如果我有完全匹配,我会这样做:
blended <- left_join(df1, df2, by = c("title" = "showname"))
如何使用向量作为记录选择器进行左连接,以便最终结果为:
title | records | showname | counts
Bob's show, part 1 | 42 | Bob's show part 1 | 772
Time for dinner | 77 | Dinner time | 89
排除第三个不匹配项,因为向量 (NA) 中可能没有匹配项。
来一张,
library(stringdist)
library(tidyverse)
df1 %>%
as_tibble() %>%
mutate(temp = amatch(title, df2$showname, maxDist = 10)) %>%
bind_cols(df2[.$temp, ]) %>%
select(-temp)
# A tibble: 3 x 4
title records showname counts
<chr> <int> <chr> <int>
1 Bob's show, part 1 42 Bob's show part 1 772
2 Time for dinner 77 Dinner time 89
3 Horsecrap 121 Dinner time 89
我无法重现你的数字匹配向量,amatch(df1$title, df2$showname)
给我 [1] NA NA NA
因为它看起来默认是 0.1,所以我将 maxDist
设置为 10。
最后,您始终可以添加 %>% filter(is.na(showname))
以删除任何不匹配的行。
数据
df1 <- structure(list(title = c("Bob's show, part 1", "Time for dinner",
"Horsecrap"), records = c(42L, 77L, 121L)), .Names = c("title",
"records"), row.names = c(NA, -3L), class = "data.frame")
df2 <- structure(list(showname = c("Bob's show part 1", "Dinner time",
"No way Jose"), counts = c(772L, 89L, 123L)), .Names = c("showname",
"counts"), row.names = c(NA, -3L), class = "data.frame")
:
Have you looked at fuzzyjoin
?
我以前从未听说过 fuzzyjoin
,但我尝试过并喜欢上了它。 stringdist_left_join
正是我所需要的。
我正在尝试基于非常相似的语言(不准确)对两个数据帧进行 dplyr 左连接。
DF1:
title | records
Bob's show, part 1 | 42
Time for dinner | 77
Horsecrap | 121
DF2:
showname | counts
Bob's show part 1 | 772
Dinner time | 89
No way Jose | 123
我执行此操作以使用 stringdist package/library:
将字符串距离作为向量获取titlematch <- amatch(df1$title,df2$showname)
向量看起来像...好吧,一个整数向量:
titlematch
1
2
NA
通常情况下,如果我有完全匹配,我会这样做:
blended <- left_join(df1, df2, by = c("title" = "showname"))
如何使用向量作为记录选择器进行左连接,以便最终结果为:
title | records | showname | counts
Bob's show, part 1 | 42 | Bob's show part 1 | 772
Time for dinner | 77 | Dinner time | 89
排除第三个不匹配项,因为向量 (NA) 中可能没有匹配项。
来一张,
library(stringdist)
library(tidyverse)
df1 %>%
as_tibble() %>%
mutate(temp = amatch(title, df2$showname, maxDist = 10)) %>%
bind_cols(df2[.$temp, ]) %>%
select(-temp)
# A tibble: 3 x 4
title records showname counts
<chr> <int> <chr> <int>
1 Bob's show, part 1 42 Bob's show part 1 772
2 Time for dinner 77 Dinner time 89
3 Horsecrap 121 Dinner time 89
我无法重现你的数字匹配向量,amatch(df1$title, df2$showname)
给我 [1] NA NA NA
因为它看起来默认是 0.1,所以我将 maxDist
设置为 10。
最后,您始终可以添加 %>% filter(is.na(showname))
以删除任何不匹配的行。
数据
df1 <- structure(list(title = c("Bob's show, part 1", "Time for dinner",
"Horsecrap"), records = c(42L, 77L, 121L)), .Names = c("title",
"records"), row.names = c(NA, -3L), class = "data.frame")
df2 <- structure(list(showname = c("Bob's show part 1", "Dinner time",
"No way Jose"), counts = c(772L, 89L, 123L)), .Names = c("showname",
"counts"), row.names = c(NA, -3L), class = "data.frame")
Have you looked at
fuzzyjoin
?
我以前从未听说过 fuzzyjoin
,但我尝试过并喜欢上了它。 stringdist_left_join
正是我所需要的。