同时进行模糊和非模糊连接
Simultaneous fuzzy and non-fuzzy join
假设我有这个数据框:
# Set random seed
set.seed(33550336)
# Number of IDs
n <- 5
# Create data frames
df <- data.frame(ID = rep(1:n, each = 10),
loc = seq(10, 100, by =10))
# ID loc
# 1 1 10
# 2 1 20
# 3 1 30
# 4 1 40
# 5 1 50
# 6 1 60
# 7 1 70
# 8 1 80
# 9 1 90
# 10 1 100
# 11 2 10
# 12 2 20
# 13 2 30
# 14 2 40
# 15 2 50
# 16 2 60
# 17 2 70
# 18 2 80
# 19 2 90
# 20 2 100
# 21 3 10
# 22 3 20
# 23 3 30
# 24 3 40
# 25 3 50
# 26 3 60
# 27 3 70
# 28 3 80
# 29 3 90
# 30 3 100
# 31 4 10
# 32 4 20
# 33 4 30
# 34 4 40
# 35 4 50
# 36 4 60
# 37 4 70
# 38 4 80
# 39 4 90
# 40 4 100
# 41 5 10
# 42 5 20
# 43 5 30
# 44 5 40
# 45 5 50
# 46 5 60
# 47 5 70
# 48 5 80
# 49 5 90
# 50 5 100
现在,我想加入第二个数据框:
df_alt <- data.frame(ID = rep(1:n, each = 10),
loc = sample(1:100, 5 * n, replace = TRUE),
value = runif(n))
# ID loc value
# 1 1 87 0.3202490
# 2 1 36 0.4724253
# 3 1 53 0.4750352
# 4 1 7 0.8744985
# 5 1 38 0.2016645
# 6 1 92 0.3202490
# 7 1 74 0.4724253
# 8 1 72 0.4750352
# 9 1 73 0.8744985
# 10 1 95 0.2016645
# 11 2 61 0.3202490
# 12 2 5 0.4724253
# 13 2 87 0.4750352
# 14 2 11 0.8744985
# 15 2 10 0.2016645
# 16 2 25 0.3202490
# 17 2 60 0.4724253
# 18 2 62 0.4750352
# 19 2 52 0.8744985
# 20 2 31 0.2016645
# 21 3 3 0.3202490
# 22 3 43 0.4724253
# 23 3 45 0.4750352
# 24 3 91 0.8744985
# 25 3 51 0.2016645
# 26 3 87 0.3202490
# 27 3 36 0.4724253
# 28 3 53 0.4750352
# 29 3 7 0.8744985
# 30 3 38 0.2016645
# 31 4 92 0.3202490
# 32 4 74 0.4724253
# 33 4 72 0.4750352
# 34 4 73 0.8744985
# 35 4 95 0.2016645
# 36 4 61 0.3202490
# 37 4 5 0.4724253
# 38 4 87 0.4750352
# 39 4 11 0.8744985
# 40 4 10 0.2016645
# 41 5 25 0.3202490
# 42 5 60 0.4724253
# 43 5 62 0.4750352
# 44 5 52 0.8744985
# 45 5 31 0.2016645
# 46 5 3 0.3202490
# 47 5 43 0.4724253
# 48 5 45 0.4750352
# 49 5 91 0.8744985
# 50 5 51 0.2016645
我想要 ID
的完美匹配和 loc
的最接近匹配。我查看了 fuzzyjoin
包,但不幸的是,不同的列不能有不同程度的模糊。也就是说,我无法为 ID
指定完美匹配,为 loc
指定模糊匹配。因此,作为解决方法,我通过 ID
进行左连接,计算 loc.x
和 loc.y
之间的距离(即 df
和 loc
s df_alt
个数据帧),按ID
和loc.x
分组,按loc
s之间的距离排序,取第一行(即最短距离):
# Bind and find nearest
df_res <- df %>%
left_join(df_alt, by = "ID") %>%
mutate(delta = abs(loc.x - loc.y)) %>%
group_by(ID, loc.x) %>%
arrange(delta) %>%
filter(row_number() == 1) %>%
ungroup %>%
arrange(ID, loc.x)
# # A tibble: 50 x 5
# ID loc.x loc.y value delta
# <int> <dbl> <int> <dbl> <dbl>
# 1 1 10 7 0.874 3
# 2 1 20 7 0.874 13
# 3 1 30 36 0.472 6
# 4 1 40 38 0.202 2
# 5 1 50 53 0.475 3
# 6 1 60 53 0.475 7
# 7 1 70 72 0.475 2
# 8 1 80 74 0.472 6
# 9 1 90 92 0.320 2
# 10 1 100 95 0.202 5
# 11 2 10 10 0.202 0
# 12 2 20 25 0.320 5
# 13 2 30 31 0.202 1
# 14 2 40 31 0.202 9
# 15 2 50 52 0.874 2
# 16 2 60 60 0.472 0
# 17 2 70 62 0.475 8
# 18 2 80 87 0.475 7
# 19 2 90 87 0.475 3
# 20 2 100 87 0.475 13
# 21 3 10 7 0.874 3
# 22 3 20 7 0.874 13
# 23 3 30 36 0.472 6
# 24 3 40 38 0.202 2
# 25 3 50 51 0.202 1
# 26 3 60 53 0.475 7
# 27 3 70 87 0.320 17
# 28 3 80 87 0.320 7
# 29 3 90 91 0.874 1
# 30 3 100 91 0.874 9
# 31 4 10 10 0.202 0
# 32 4 20 11 0.874 9
# 33 4 30 11 0.874 19
# 34 4 40 61 0.320 21
# 35 4 50 61 0.320 11
# 36 4 60 61 0.320 1
# 37 4 70 72 0.475 2
# 38 4 80 74 0.472 6
# 39 4 90 92 0.320 2
# 40 4 100 95 0.202 5
# 41 5 10 3 0.320 7
# 42 5 20 25 0.320 5
# 43 5 30 31 0.202 1
# 44 5 40 43 0.472 3
# 45 5 50 51 0.202 1
# 46 5 60 60 0.472 0
# 47 5 70 62 0.475 8
# 48 5 80 91 0.874 11
# 49 5 90 91 0.874 1
# 50 5 100 91 0.874 9
这不是特别有效,但给出了预期的结果。当数据框变大时就会出现问题。使用足够大的 n
重新运行上述代码会产生以下错误:
Error: cannot allocate vector of size...
我认为这是因为左连接产生了一个不必要的巨大数据框。显然,join-then-filter 不是最好的策略。但是什么是同时进行模糊和非模糊连接的最佳方法?
我认为 data.table 软件包最适合这项工作:
library(data.table)
setDT(df)
setDT(df_alt)
df_alt[df
, on = .(ID, loc)
, roll = "nearest"
, .(ID, loc.x = i.loc, loc.y = x.loc, value, delta = abs(i.loc - x.loc))]
给出:
ID loc.x loc.y value delta
1: 1 10 7 0.8744985 3
2: 1 20 7 0.8744985 13
3: 1 30 36 0.4724253 6
4: 1 40 38 0.2016645 2
5: 1 50 53 0.4750352 3
6: 1 60 53 0.4750352 7
7: 1 70 72 0.4750352 2
8: 1 80 74 0.4724253 6
9: 1 90 92 0.3202490 2
10: 1 100 95 0.2016645 5
11: 2 10 10 0.2016645 0
12: 2 20 25 0.3202490 5
13: 2 30 31 0.2016645 1
14: 2 40 31 0.2016645 9
15: 2 50 52 0.8744985 2
16: 2 60 60 0.4724253 0
17: 2 70 62 0.4750352 8
18: 2 80 87 0.4750352 7
19: 2 90 87 0.4750352 3
20: 2 100 87 0.4750352 13
21: 3 10 7 0.8744985 3
22: 3 20 7 0.8744985 13
23: 3 30 36 0.4724253 6
24: 3 40 38 0.2016645 2
25: 3 50 51 0.2016645 1
26: 3 60 53 0.4750352 7
27: 3 70 53 0.4750352 17
28: 3 80 87 0.3202490 7
29: 3 90 91 0.8744985 1
30: 3 100 91 0.8744985 9
31: 4 10 10 0.2016645 0
32: 4 20 11 0.8744985 9
33: 4 30 11 0.8744985 19
34: 4 40 61 0.3202490 21
35: 4 50 61 0.3202490 11
36: 4 60 61 0.3202490 1
37: 4 70 72 0.4750352 2
38: 4 80 74 0.4724253 6
39: 4 90 92 0.3202490 2
40: 4 100 95 0.2016645 5
41: 5 10 3 0.3202490 7
42: 5 20 25 0.3202490 5
43: 5 30 31 0.2016645 1
44: 5 40 43 0.4724253 3
45: 5 50 51 0.2016645 1
46: 5 60 60 0.4724253 0
47: 5 70 62 0.4750352 8
48: 5 80 91 0.8744985 11
49: 5 90 91 0.8744985 1
50: 5 100 91 0.8744985 9
假设我有这个数据框:
# Set random seed
set.seed(33550336)
# Number of IDs
n <- 5
# Create data frames
df <- data.frame(ID = rep(1:n, each = 10),
loc = seq(10, 100, by =10))
# ID loc
# 1 1 10
# 2 1 20
# 3 1 30
# 4 1 40
# 5 1 50
# 6 1 60
# 7 1 70
# 8 1 80
# 9 1 90
# 10 1 100
# 11 2 10
# 12 2 20
# 13 2 30
# 14 2 40
# 15 2 50
# 16 2 60
# 17 2 70
# 18 2 80
# 19 2 90
# 20 2 100
# 21 3 10
# 22 3 20
# 23 3 30
# 24 3 40
# 25 3 50
# 26 3 60
# 27 3 70
# 28 3 80
# 29 3 90
# 30 3 100
# 31 4 10
# 32 4 20
# 33 4 30
# 34 4 40
# 35 4 50
# 36 4 60
# 37 4 70
# 38 4 80
# 39 4 90
# 40 4 100
# 41 5 10
# 42 5 20
# 43 5 30
# 44 5 40
# 45 5 50
# 46 5 60
# 47 5 70
# 48 5 80
# 49 5 90
# 50 5 100
现在,我想加入第二个数据框:
df_alt <- data.frame(ID = rep(1:n, each = 10),
loc = sample(1:100, 5 * n, replace = TRUE),
value = runif(n))
# ID loc value
# 1 1 87 0.3202490
# 2 1 36 0.4724253
# 3 1 53 0.4750352
# 4 1 7 0.8744985
# 5 1 38 0.2016645
# 6 1 92 0.3202490
# 7 1 74 0.4724253
# 8 1 72 0.4750352
# 9 1 73 0.8744985
# 10 1 95 0.2016645
# 11 2 61 0.3202490
# 12 2 5 0.4724253
# 13 2 87 0.4750352
# 14 2 11 0.8744985
# 15 2 10 0.2016645
# 16 2 25 0.3202490
# 17 2 60 0.4724253
# 18 2 62 0.4750352
# 19 2 52 0.8744985
# 20 2 31 0.2016645
# 21 3 3 0.3202490
# 22 3 43 0.4724253
# 23 3 45 0.4750352
# 24 3 91 0.8744985
# 25 3 51 0.2016645
# 26 3 87 0.3202490
# 27 3 36 0.4724253
# 28 3 53 0.4750352
# 29 3 7 0.8744985
# 30 3 38 0.2016645
# 31 4 92 0.3202490
# 32 4 74 0.4724253
# 33 4 72 0.4750352
# 34 4 73 0.8744985
# 35 4 95 0.2016645
# 36 4 61 0.3202490
# 37 4 5 0.4724253
# 38 4 87 0.4750352
# 39 4 11 0.8744985
# 40 4 10 0.2016645
# 41 5 25 0.3202490
# 42 5 60 0.4724253
# 43 5 62 0.4750352
# 44 5 52 0.8744985
# 45 5 31 0.2016645
# 46 5 3 0.3202490
# 47 5 43 0.4724253
# 48 5 45 0.4750352
# 49 5 91 0.8744985
# 50 5 51 0.2016645
我想要 ID
的完美匹配和 loc
的最接近匹配。我查看了 fuzzyjoin
包,但不幸的是,不同的列不能有不同程度的模糊。也就是说,我无法为 ID
指定完美匹配,为 loc
指定模糊匹配。因此,作为解决方法,我通过 ID
进行左连接,计算 loc.x
和 loc.y
之间的距离(即 df
和 loc
s df_alt
个数据帧),按ID
和loc.x
分组,按loc
s之间的距离排序,取第一行(即最短距离):
# Bind and find nearest
df_res <- df %>%
left_join(df_alt, by = "ID") %>%
mutate(delta = abs(loc.x - loc.y)) %>%
group_by(ID, loc.x) %>%
arrange(delta) %>%
filter(row_number() == 1) %>%
ungroup %>%
arrange(ID, loc.x)
# # A tibble: 50 x 5
# ID loc.x loc.y value delta
# <int> <dbl> <int> <dbl> <dbl>
# 1 1 10 7 0.874 3
# 2 1 20 7 0.874 13
# 3 1 30 36 0.472 6
# 4 1 40 38 0.202 2
# 5 1 50 53 0.475 3
# 6 1 60 53 0.475 7
# 7 1 70 72 0.475 2
# 8 1 80 74 0.472 6
# 9 1 90 92 0.320 2
# 10 1 100 95 0.202 5
# 11 2 10 10 0.202 0
# 12 2 20 25 0.320 5
# 13 2 30 31 0.202 1
# 14 2 40 31 0.202 9
# 15 2 50 52 0.874 2
# 16 2 60 60 0.472 0
# 17 2 70 62 0.475 8
# 18 2 80 87 0.475 7
# 19 2 90 87 0.475 3
# 20 2 100 87 0.475 13
# 21 3 10 7 0.874 3
# 22 3 20 7 0.874 13
# 23 3 30 36 0.472 6
# 24 3 40 38 0.202 2
# 25 3 50 51 0.202 1
# 26 3 60 53 0.475 7
# 27 3 70 87 0.320 17
# 28 3 80 87 0.320 7
# 29 3 90 91 0.874 1
# 30 3 100 91 0.874 9
# 31 4 10 10 0.202 0
# 32 4 20 11 0.874 9
# 33 4 30 11 0.874 19
# 34 4 40 61 0.320 21
# 35 4 50 61 0.320 11
# 36 4 60 61 0.320 1
# 37 4 70 72 0.475 2
# 38 4 80 74 0.472 6
# 39 4 90 92 0.320 2
# 40 4 100 95 0.202 5
# 41 5 10 3 0.320 7
# 42 5 20 25 0.320 5
# 43 5 30 31 0.202 1
# 44 5 40 43 0.472 3
# 45 5 50 51 0.202 1
# 46 5 60 60 0.472 0
# 47 5 70 62 0.475 8
# 48 5 80 91 0.874 11
# 49 5 90 91 0.874 1
# 50 5 100 91 0.874 9
这不是特别有效,但给出了预期的结果。当数据框变大时就会出现问题。使用足够大的 n
重新运行上述代码会产生以下错误:
Error: cannot allocate vector of size...
我认为这是因为左连接产生了一个不必要的巨大数据框。显然,join-then-filter 不是最好的策略。但是什么是同时进行模糊和非模糊连接的最佳方法?
我认为 data.table 软件包最适合这项工作:
library(data.table)
setDT(df)
setDT(df_alt)
df_alt[df
, on = .(ID, loc)
, roll = "nearest"
, .(ID, loc.x = i.loc, loc.y = x.loc, value, delta = abs(i.loc - x.loc))]
给出:
ID loc.x loc.y value delta 1: 1 10 7 0.8744985 3 2: 1 20 7 0.8744985 13 3: 1 30 36 0.4724253 6 4: 1 40 38 0.2016645 2 5: 1 50 53 0.4750352 3 6: 1 60 53 0.4750352 7 7: 1 70 72 0.4750352 2 8: 1 80 74 0.4724253 6 9: 1 90 92 0.3202490 2 10: 1 100 95 0.2016645 5 11: 2 10 10 0.2016645 0 12: 2 20 25 0.3202490 5 13: 2 30 31 0.2016645 1 14: 2 40 31 0.2016645 9 15: 2 50 52 0.8744985 2 16: 2 60 60 0.4724253 0 17: 2 70 62 0.4750352 8 18: 2 80 87 0.4750352 7 19: 2 90 87 0.4750352 3 20: 2 100 87 0.4750352 13 21: 3 10 7 0.8744985 3 22: 3 20 7 0.8744985 13 23: 3 30 36 0.4724253 6 24: 3 40 38 0.2016645 2 25: 3 50 51 0.2016645 1 26: 3 60 53 0.4750352 7 27: 3 70 53 0.4750352 17 28: 3 80 87 0.3202490 7 29: 3 90 91 0.8744985 1 30: 3 100 91 0.8744985 9 31: 4 10 10 0.2016645 0 32: 4 20 11 0.8744985 9 33: 4 30 11 0.8744985 19 34: 4 40 61 0.3202490 21 35: 4 50 61 0.3202490 11 36: 4 60 61 0.3202490 1 37: 4 70 72 0.4750352 2 38: 4 80 74 0.4724253 6 39: 4 90 92 0.3202490 2 40: 4 100 95 0.2016645 5 41: 5 10 3 0.3202490 7 42: 5 20 25 0.3202490 5 43: 5 30 31 0.2016645 1 44: 5 40 43 0.4724253 3 45: 5 50 51 0.2016645 1 46: 5 60 60 0.4724253 0 47: 5 70 62 0.4750352 8 48: 5 80 91 0.8744985 11 49: 5 90 91 0.8744985 1 50: 5 100 91 0.8744985 9