stringdist_semi_join 仅显示 dataframe1 中的列
stringdist_semi_join only shows columns from dataframe1
我有两个数据框:
df1 <- data.frame(City=c("Munchen_Paris","Munchen_Paris","Barcelona_Milan", "Londen_Dublin","Madrid_Malaga"),
value1=c(11,21,33,2,53))
df2 <- data.frame(City=c("Munnich_Parijs","Barcelona_Munster","Barcelona_Milan","London_Dub","London_Oxford","Pisa_Luik"),
value2=c(22,2,44,54,29,65))
我尝试用 fuzzyjoin 合并这些数据帧。
我要找的结果是:
City.x value1 City.y value2 string_distance
1 Munchen_Paris 11 Munnich_Parijs 22 5
2 Munchen_Paris 21 Munnich_Parijs 22 5
3 Barcelona_Milan 33 Barcelona_Milan 44 0
4 Londen_Dublin 2 London_Dub 54 4
(对于 df1 中的每一行,在 df2 中匹配 string_distance < 9 的城市,我想要新的 table 中的一行包含 df1 和 df2 中最低的所有列string_distance)
当我这样做时:
df3 <- stringdist_semi_join(df1, df2, by = "City", max_dist = 9, distance_col = "string_distance")
我只收到这些专栏:
> df3
City value1
1 Munchen_Paris 11
2 Munchen_Paris 21
3 Barcelona_Milan 33
4 Londen_Dublin 2
如果我进行完全连接,我会收到:
> df3 <- stringdist_full_join(df1, df2, by = "City", max_dist = 9, distance_col = "string_distance")
> df3
City.x value1 City.y value2 string_distance
1 Munchen_Paris 11 Munnich_Parijs 22 5
2 Munchen_Paris 21 Munnich_Parijs 22 5
3 Barcelona_Milan 33 Barcelona_Munster 2 6
4 Barcelona_Milan 33 Barcelona_Milan 44 0
5 Londen_Dublin 2 London_Dub 54 4
6 Londen_Dublin 2 London_Oxford 29 7
7 Madrid_Malaga 53 <NA> NA NA
8 <NA> NA Pisa_Luik 65 NA
我可以删除包含 NA 和 group_by City.x 的行,尽管这样我丢失了前两行中的一个。
如果我这样做 inner_join 我会收到:
df3 <- stringdist_inner_join(df1, df2, by = "City", max_dist = 9, distance_col = "string_distance")
df3
> df3
City.x value1 City.y value2 string_distance
1 Munchen_Paris 11 Munnich_Parijs 22 5
2 Munchen_Paris 21 Munnich_Parijs 22 5
3 Barcelona_Milan 33 Barcelona_Munster 2 6
4 Barcelona_Milan 33 Barcelona_Milan 44 0
5 Londen_Dublin 2 London_Dub 54 4
6 Londen_Dublin 2 London_Oxford 29 7
奇怪stringdist_semi_join没有显示df2的列吗?
有没有其他方法可以达到我在上面第一个 table 中寻找的结果?
非常感谢!
半连接的作用 (from the dplyr documentation):
return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.
所以您看到的行为是正常的。
您正在寻找内部联接:
return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
在 fuzzyjoin
中称为 stringdist_inner_join
:
df3 <- stringdist_inner_join(df1, df2, by = "City", max_dist = 9, distance_col = "string_distance")
我有两个数据框:
df1 <- data.frame(City=c("Munchen_Paris","Munchen_Paris","Barcelona_Milan", "Londen_Dublin","Madrid_Malaga"),
value1=c(11,21,33,2,53))
df2 <- data.frame(City=c("Munnich_Parijs","Barcelona_Munster","Barcelona_Milan","London_Dub","London_Oxford","Pisa_Luik"),
value2=c(22,2,44,54,29,65))
我尝试用 fuzzyjoin 合并这些数据帧。
我要找的结果是:
City.x value1 City.y value2 string_distance
1 Munchen_Paris 11 Munnich_Parijs 22 5
2 Munchen_Paris 21 Munnich_Parijs 22 5
3 Barcelona_Milan 33 Barcelona_Milan 44 0
4 Londen_Dublin 2 London_Dub 54 4
(对于 df1 中的每一行,在 df2 中匹配 string_distance < 9 的城市,我想要新的 table 中的一行包含 df1 和 df2 中最低的所有列string_distance) 当我这样做时:
df3 <- stringdist_semi_join(df1, df2, by = "City", max_dist = 9, distance_col = "string_distance")
我只收到这些专栏:
> df3
City value1
1 Munchen_Paris 11
2 Munchen_Paris 21
3 Barcelona_Milan 33
4 Londen_Dublin 2
如果我进行完全连接,我会收到:
> df3 <- stringdist_full_join(df1, df2, by = "City", max_dist = 9, distance_col = "string_distance")
> df3
City.x value1 City.y value2 string_distance
1 Munchen_Paris 11 Munnich_Parijs 22 5
2 Munchen_Paris 21 Munnich_Parijs 22 5
3 Barcelona_Milan 33 Barcelona_Munster 2 6
4 Barcelona_Milan 33 Barcelona_Milan 44 0
5 Londen_Dublin 2 London_Dub 54 4
6 Londen_Dublin 2 London_Oxford 29 7
7 Madrid_Malaga 53 <NA> NA NA
8 <NA> NA Pisa_Luik 65 NA
我可以删除包含 NA 和 group_by City.x 的行,尽管这样我丢失了前两行中的一个。
如果我这样做 inner_join 我会收到:
df3 <- stringdist_inner_join(df1, df2, by = "City", max_dist = 9, distance_col = "string_distance")
df3
> df3
City.x value1 City.y value2 string_distance
1 Munchen_Paris 11 Munnich_Parijs 22 5
2 Munchen_Paris 21 Munnich_Parijs 22 5
3 Barcelona_Milan 33 Barcelona_Munster 2 6
4 Barcelona_Milan 33 Barcelona_Milan 44 0
5 Londen_Dublin 2 London_Dub 54 4
6 Londen_Dublin 2 London_Oxford 29 7
奇怪stringdist_semi_join没有显示df2的列吗? 有没有其他方法可以达到我在上面第一个 table 中寻找的结果?
非常感谢!
半连接的作用 (from the dplyr documentation):
return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.
所以您看到的行为是正常的。
您正在寻找内部联接:
return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
在 fuzzyjoin
中称为 stringdist_inner_join
:
df3 <- stringdist_inner_join(df1, df2, by = "City", max_dist = 9, distance_col = "string_distance")