stringdist_semi_join 仅显示 dataframe1 中的列

stringdist_semi_join only shows columns from dataframe1

我有两个数据框:

df1 <- data.frame(City=c("Munchen_Paris","Munchen_Paris","Barcelona_Milan", "Londen_Dublin","Madrid_Malaga"), 
                  value1=c(11,21,33,2,53))

df2 <- data.frame(City=c("Munnich_Parijs","Barcelona_Munster","Barcelona_Milan","London_Dub","London_Oxford","Pisa_Luik"), 
                  value2=c(22,2,44,54,29,65))

我尝试用 fuzzyjoin 合并这些数据帧。

我要找的结果是:

           City.x  value1   City.y             value2  string_distance
1   Munchen_Paris      11   Munnich_Parijs     22      5
2   Munchen_Paris      21   Munnich_Parijs     22      5
3 Barcelona_Milan      33   Barcelona_Milan    44      0
4   Londen_Dublin       2   London_Dub         54      4

(对于 df1 中的每一行,在 df2 中匹配 string_distance < 9 的城市,我想要新的 table 中的一行包含 df1 和 df2 中最低的所有列string_distance) 当我这样做时:

df3 <- stringdist_semi_join(df1, df2, by = "City", max_dist = 9, distance_col = "string_distance")

我只收到这些专栏:

> df3
             City value1
1   Munchen_Paris     11
2   Munchen_Paris     21
3 Barcelona_Milan     33
4   Londen_Dublin      2

如果我进行完全连接,我会收到:

> df3 <- stringdist_full_join(df1, df2, by = "City", max_dist = 9, distance_col = "string_distance")

> df3
           City.x value1            City.y value2 string_distance
1   Munchen_Paris     11    Munnich_Parijs     22               5
2   Munchen_Paris     21    Munnich_Parijs     22               5
3 Barcelona_Milan     33 Barcelona_Munster      2               6
4 Barcelona_Milan     33   Barcelona_Milan     44               0
5   Londen_Dublin      2        London_Dub     54               4
6   Londen_Dublin      2     London_Oxford     29               7
7   Madrid_Malaga     53              <NA>     NA              NA
8            <NA>     NA         Pisa_Luik     65              NA

我可以删除包含 NA 和 group_by City.x 的行,尽管这样我丢失了前两行中的一个。

如果我这样做 inner_join 我会收到:

    df3 <- stringdist_inner_join(df1, df2, by = "City", max_dist = 9, distance_col = "string_distance")

df3

> df3
           City.x value1            City.y value2 string_distance
1   Munchen_Paris     11    Munnich_Parijs     22               5
2   Munchen_Paris     21    Munnich_Parijs     22               5
3 Barcelona_Milan     33 Barcelona_Munster      2               6
4 Barcelona_Milan     33   Barcelona_Milan     44               0
5   Londen_Dublin      2        London_Dub     54               4
6   Londen_Dublin      2     London_Oxford     29               7

奇怪stringdist_semi_join没有显示df2的列吗? 有没有其他方法可以达到我在上面第一个 table 中寻找的结果?

非常感谢!

半连接的作用 (from the dplyr documentation):

return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

所以您看到的行为是正常的。

您正在寻找内部联接:

return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

fuzzyjoin 中称为 stringdist_inner_join:

df3 <- stringdist_inner_join(df1, df2, by = "City", max_dist = 9, distance_col = "string_distance")