如何在 R 中使用 fuzzyjoin::difference_* 执行模糊连接
How to perform a fuzzy join with fuzzyjoin::difference_* in R
我正在处理两个不同的数据集,我想根据阈值将它们合并。假设这两个数据帧如下所示:
library(dplyr)
library(fuzzyjoin)
library(lubridate)
df1 = data_frame(Item=1:5,
DateTime=c("2015-01-01 11:12:14", "2015-01-02 09:15:23",
"2015-01-02 15:46:11", "2015-04-19 22:11:33",
"2015-06-10 07:00:00"),
Count=c(1, 6, 11, 15, 9),
Name="Sterling",
Friend=c("Pam", "Cyril", "Cheryl", "Mallory", "Lana"))
df1$DateTime = ymd_hms(df1$DateTime)
df2 = data_frame(Item=21:25,
DateTime=c("2015-01-01 11:12:15", "2015-01-02 19:15:23",
"2015-01-02 15:46:11", "2015-05-19 22:11:33",
"2015-06-10 07:00:02"),
Count=c(3, 7, 11, 15, 8),
Name="Sterling",
Friend=c("Pam", "Kreger", "Woodhouse", "Gillete", "Lana"))
df2$DateTime = ymd_hms(df2$DateTime)
我现在想要的是能够在两秒内根据 DateTime
和 Count
的模糊匹配左连接 df2
和 df1
它们各自的值,而除 Item
之外的所有其他值都是相同的。我以为我可以通过以下方式到达那里:
df1 %>%
difference_left_join(df2, by=c("DateTime", "Count"), max_dist=2)
但这给了我以下输出:
# A tibble: 8 × 10
Item.x DateTime.x Count.x Name.x Friend.x Item.y DateTime.y Count.y Name.y Friend.y
<int> <dttm> <dbl> <chr> <chr> <int> <dttm> <dbl> <chr> <chr>
1 1 2015-01-01 11:12:14 1 Sterling Pam 21 2015-01-01 11:12:15 3 Sterling Pam
2 1 2015-01-01 11:12:14 1 Sterling Pam 21 2015-01-01 11:12:15 3 Sterling Pam
3 2 2015-01-02 09:15:23 6 Sterling Cyril NA <NA> NA <NA> <NA>
4 3 2015-01-02 15:46:11 11 Sterling Cheryl 23 2015-01-02 15:46:11 11 Sterling Woodhouse
5 3 2015-01-02 15:46:11 11 Sterling Cheryl 23 2015-01-02 15:46:11 11 Sterling Woodhouse
6 4 2015-04-19 22:11:33 15 Sterling Mallory NA <NA> NA <NA> <NA>
7 5 2015-06-10 07:00:00 9 Sterling Lana 25 2015-06-10 07:00:02 8 Sterling Lana
8 5 2015-06-10 07:00:00 9 Sterling Lana 25 2015-06-10 07:00:02 8 Sterling Lana
这很接近,除了第 3 行不应该合并,因为名称不同(我希望第 2 行在给定阈值的情况下合并,即使我不希望它合并)。
如何得到以下数据框?请注意,尽管 DateTime
和 Count
满足阈值限制,但 df2
的第二行和第三行未合并。这是因为其他列(Item
除外)不相同。
desired_output
# Item DateTime Count Name Friend
# 1 3 2015-01-02 15:46:11 11 Sterling Cheryl
# 2 2 2015-01-02 09:15:23 6 Sterling Cyril
# 3 5 2015-06-10 07:00:00 9 Sterling Lana
# 4 25 2015-06-10 07:00:02 8 Sterling Lana
# 5 4 2015-04-19 22:11:33 15 Sterling Mallory
# 6 1 2015-01-01 11:12:14 1 Sterling Pam
# 7 21 2015-01-01 11:12:15 3 Sterling Pam
好的,所以,您收到的消息是因为无法在非数字列上计算模糊匹配。
要做的事情就是把它转换成数字。由于您的卡尺以秒为单位,我将其转换为秒,然后将它们设为数字:
library(dplyr)
library(fuzzyjoin)
library(lubridate)
df1 = data_frame(Item=1:5,
DateTime=c("2015-01-01 11:12:14", "2015-01-02 09:15:23",
"2015-01-02 15:46:11", "2015-04-19 22:11:33",
"2015-06-10 07:00:00"),
Count=c(1, 6, 11, 15, 9),
Name="Sterling",
Friend=c("Pam", "Cyril", "Cheryl", "Mallory", "Lana"))
df1$DateTime1 = as.numeric(seconds(ymd_hms(df1$DateTime)))
df2 = data_frame(Item=21:25,
DateTime=c("2015-01-01 11:12:15", "2015-01-02 19:25:56",
"2015-01-02 15:46:11", "2015-05-19 22:11:33",
"2015-06-10 07:00:02"),
Count=c(3, 6, 11, 15, 8),
Name="Sterling",
Friend=c("Pam", "Kreger", "Woodhouse", "Gillete", "Lana"))
df2$DateTime1 = as.numeric(seconds(ymd_hms(df2$DateTime)))
df1 %>%
difference_left_join(y=df2, by=c("DateTime1", "Count"), max_dist=2)
根据我们在评论中的讨论,将其子集化为其他字符列匹配的情况的简单调整是:
df1[df2$Friend == df1$Friend,] %>%
difference_left_join(y=df2[df2$Friend == df1$Friend,], by=c("DateTime1", "Count"), max_dist=2)
该示例仅适用于 Friend
,但您当然可以使用 &
来处理多个列。
我正在处理两个不同的数据集,我想根据阈值将它们合并。假设这两个数据帧如下所示:
library(dplyr)
library(fuzzyjoin)
library(lubridate)
df1 = data_frame(Item=1:5,
DateTime=c("2015-01-01 11:12:14", "2015-01-02 09:15:23",
"2015-01-02 15:46:11", "2015-04-19 22:11:33",
"2015-06-10 07:00:00"),
Count=c(1, 6, 11, 15, 9),
Name="Sterling",
Friend=c("Pam", "Cyril", "Cheryl", "Mallory", "Lana"))
df1$DateTime = ymd_hms(df1$DateTime)
df2 = data_frame(Item=21:25,
DateTime=c("2015-01-01 11:12:15", "2015-01-02 19:15:23",
"2015-01-02 15:46:11", "2015-05-19 22:11:33",
"2015-06-10 07:00:02"),
Count=c(3, 7, 11, 15, 8),
Name="Sterling",
Friend=c("Pam", "Kreger", "Woodhouse", "Gillete", "Lana"))
df2$DateTime = ymd_hms(df2$DateTime)
我现在想要的是能够在两秒内根据 DateTime
和 Count
的模糊匹配左连接 df2
和 df1
它们各自的值,而除 Item
之外的所有其他值都是相同的。我以为我可以通过以下方式到达那里:
df1 %>%
difference_left_join(df2, by=c("DateTime", "Count"), max_dist=2)
但这给了我以下输出:
# A tibble: 8 × 10
Item.x DateTime.x Count.x Name.x Friend.x Item.y DateTime.y Count.y Name.y Friend.y
<int> <dttm> <dbl> <chr> <chr> <int> <dttm> <dbl> <chr> <chr>
1 1 2015-01-01 11:12:14 1 Sterling Pam 21 2015-01-01 11:12:15 3 Sterling Pam
2 1 2015-01-01 11:12:14 1 Sterling Pam 21 2015-01-01 11:12:15 3 Sterling Pam
3 2 2015-01-02 09:15:23 6 Sterling Cyril NA <NA> NA <NA> <NA>
4 3 2015-01-02 15:46:11 11 Sterling Cheryl 23 2015-01-02 15:46:11 11 Sterling Woodhouse
5 3 2015-01-02 15:46:11 11 Sterling Cheryl 23 2015-01-02 15:46:11 11 Sterling Woodhouse
6 4 2015-04-19 22:11:33 15 Sterling Mallory NA <NA> NA <NA> <NA>
7 5 2015-06-10 07:00:00 9 Sterling Lana 25 2015-06-10 07:00:02 8 Sterling Lana
8 5 2015-06-10 07:00:00 9 Sterling Lana 25 2015-06-10 07:00:02 8 Sterling Lana
这很接近,除了第 3 行不应该合并,因为名称不同(我希望第 2 行在给定阈值的情况下合并,即使我不希望它合并)。
如何得到以下数据框?请注意,尽管 DateTime
和 Count
满足阈值限制,但 df2
的第二行和第三行未合并。这是因为其他列(Item
除外)不相同。
desired_output
# Item DateTime Count Name Friend
# 1 3 2015-01-02 15:46:11 11 Sterling Cheryl
# 2 2 2015-01-02 09:15:23 6 Sterling Cyril
# 3 5 2015-06-10 07:00:00 9 Sterling Lana
# 4 25 2015-06-10 07:00:02 8 Sterling Lana
# 5 4 2015-04-19 22:11:33 15 Sterling Mallory
# 6 1 2015-01-01 11:12:14 1 Sterling Pam
# 7 21 2015-01-01 11:12:15 3 Sterling Pam
好的,所以,您收到的消息是因为无法在非数字列上计算模糊匹配。
要做的事情就是把它转换成数字。由于您的卡尺以秒为单位,我将其转换为秒,然后将它们设为数字:
library(dplyr)
library(fuzzyjoin)
library(lubridate)
df1 = data_frame(Item=1:5,
DateTime=c("2015-01-01 11:12:14", "2015-01-02 09:15:23",
"2015-01-02 15:46:11", "2015-04-19 22:11:33",
"2015-06-10 07:00:00"),
Count=c(1, 6, 11, 15, 9),
Name="Sterling",
Friend=c("Pam", "Cyril", "Cheryl", "Mallory", "Lana"))
df1$DateTime1 = as.numeric(seconds(ymd_hms(df1$DateTime)))
df2 = data_frame(Item=21:25,
DateTime=c("2015-01-01 11:12:15", "2015-01-02 19:25:56",
"2015-01-02 15:46:11", "2015-05-19 22:11:33",
"2015-06-10 07:00:02"),
Count=c(3, 6, 11, 15, 8),
Name="Sterling",
Friend=c("Pam", "Kreger", "Woodhouse", "Gillete", "Lana"))
df2$DateTime1 = as.numeric(seconds(ymd_hms(df2$DateTime)))
df1 %>%
difference_left_join(y=df2, by=c("DateTime1", "Count"), max_dist=2)
根据我们在评论中的讨论,将其子集化为其他字符列匹配的情况的简单调整是:
df1[df2$Friend == df1$Friend,] %>%
difference_left_join(y=df2[df2$Friend == df1$Friend,], by=c("DateTime1", "Count"), max_dist=2)
该示例仅适用于 Friend
,但您当然可以使用 &
来处理多个列。