使用 sqldf 在滞后的 window 中准确加入 id 和最近日期
Use sqldf to join exactly on id and on the most recent date in a lagged window
我想加入两个数据集,A
和 B
。我想加入 A
和 B
完全在他们的 id
变量上,但只保留 B
中三个月到三岁之间的最新观察。
数据集足够大,我需要使用 sqldf
包(A
中大约有 500,000 行,B
中大约有 250,000 行)。看来逻辑应该是LEFT OUTER JOIN A AND B
和A.id = B.id
和(A.date - B.date) BETWEEN 3*30 AND 3*365
,然后GROUP BY A.row
,ORDER BY B.date DESC
,然后保持先观察。但是我下面的代码保留了第一个观察结果,而不是每个 A.row
组的第一个观察结果。
我可以分两步完成此连接(一步 sqldf
,一步 tidyverse
),但是 sqldf
可以两步完成吗?
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
library(sqldf)
#> Loading required package: gsubfn
#> Loading required package: proto
#> Loading required package: RSQLite
# Some toy data:
A <- tibble(id = rep(1:10, each = 2),
subid = rep(1:2, 10),
date = rep(ymd('2019-01-01'), 20))
A$row <- seq(nrow(A))
set.seed(42)
B <- tibble(id = rep(1:10, each = 10),
date = ymd('2015-01-01') + months(10*rep(1:10, 10)),
x = runif(100))
# This code properly matches A and B, but only returns the first observation OVERALL, not per A.row:
C <- sqldf('SELECT *
FROM A
LEFT OUTER JOIN B
ON A.id = B.id
AND (A.date - B.date) BETWEEN 3*30 and 3*365
GROUP BY row
ORDER BY B.date DESC
LIMIT 1') %>%
as_tibble()
C
#> # A tibble: 1 x 7
#> id subid date row id..5 date..6 x
#> <int> <int> <date> <int> <int> <dbl> <dbl>
#> 1 1 1 2019-01-01 1 1 17652 0.830
# I could do this in two steps, with the first step in sqldf and the second step in the tidyverse. This two step approach would work my data, because B has annual data, so there should not be more than three matches per row in A. However, it seems like I should be able to do the entire join in sqldf (and maybe one data I will not be able to do the second step in the tidyverse).
D <- sqldf('SELECT *
FROM A
LEFT OUTER JOIN B
ON A.id = B.id
AND (A.date - B.date) BETWEEN 3*30 and 3*365') %>%
as_tibble()
E <- D %>%
arrange(row, desc(date..6)) %>%
group_by(row) %>%
filter(row_number() == 1) %>%
ungroup()
# Below is the desired output. Can sqldf do both steps?
E
#> # A tibble: 20 x 7
#> id subid date row id..5 date..6 x
#> <int> <int> <date> <int> <int> <dbl> <dbl>
#> 1 1 1 2019-01-01 1 1 17652 0.830
#> 2 1 2 2019-01-01 2 1 17652 0.830
#> 3 2 1 2019-01-01 3 2 17652 0.255
#> 4 2 2 2019-01-01 4 2 17652 0.255
#> 5 3 1 2019-01-01 5 3 17652 0.947
#> 6 3 2 2019-01-01 6 3 17652 0.947
#> 7 4 1 2019-01-01 7 4 17652 0.685
#> 8 4 2 2019-01-01 8 4 17652 0.685
#> 9 5 1 2019-01-01 9 5 17652 0.974
#> 10 5 2 2019-01-01 10 5 17652 0.974
#> 11 6 1 2019-01-01 11 6 17652 0.785
#> 12 6 2 2019-01-01 12 6 17652 0.785
#> 13 7 1 2019-01-01 13 7 17652 0.566
#> 14 7 2 2019-01-01 14 7 17652 0.566
#> 15 8 1 2019-01-01 15 8 17652 0.479
#> 16 8 2 2019-01-01 16 8 17652 0.479
#> 17 9 1 2019-01-01 17 9 17652 0.646
#> 18 9 2 2019-01-01 18 9 17652 0.646
#> 19 10 1 2019-01-01 19 10 17652 0.933
#> 20 10 2 2019-01-01 20 10 17652 0.933
由 reprex package (v0.3.0)
于 2019-07-12 创建
考虑 window 函数,例如 RANK()
,其中可能采用 dplyr::row_number()
(以及其他 SQL 语义,例如 select
、group_by
, case_when
). SQLite(sqldf
的默认方言)最近在版本 3.25.0(2018 年 9 月发布)中添加了对 window functions 的支持。
如果 sqldf
中不可用(取决于版本),请通过 RPostgreSQL
使用 Postgres 后端。见作者 docs。可能过早或很快,RMySQL
将成为另一个受支持的后端,因为 MySQL 8 最近添加了对 window 函数的支持。
library(RPostgreSQL)
library(sqldf)
D <- sqldf('WITH cte AS
(SELECT *,
RANK() OVER (PARTITION BY "B".row ORDER BY "B".date DESC) AS rn
FROM "A"
LEFT JOIN "B"
ON "A".id = "B".id
AND ("A".date - "B".date) BETWEEN 3*30 and 3*365
)
SELECT * FROM cte
WHERE rn = 1')
在 SQLite 中,如果您在 group by
中使用 max
或 min
,那么将使用整行,因此:
sqldf('SELECT
A.rowid as A_row,
A.id,
A.subid,
A.date as A_date__Date,
max(B.rowid) as B_row,
B.date as B_date__Date,
B.x
FROM A
LEFT OUTER JOIN B ON A.id = B.id AND (A.date - B.date) BETWEEN 3*30 AND 3*365
GROUP BY A.rowid
', method = "name__class")
给予:
A_row id subid A_date B_row B_date x
1 1 1 1 2019-01-01 4 2018-05-01 0.8304476
2 2 1 2 2019-01-01 4 2018-05-01 0.8304476
3 3 2 1 2019-01-01 14 2018-05-01 0.2554288
4 4 2 2 2019-01-01 14 2018-05-01 0.2554288
5 5 3 1 2019-01-01 24 2018-05-01 0.9466682
6 6 3 2 2019-01-01 24 2018-05-01 0.9466682
7 7 4 1 2019-01-01 34 2018-05-01 0.6851697
8 8 4 2 2019-01-01 34 2018-05-01 0.6851697
9 9 5 1 2019-01-01 44 2018-05-01 0.9735399
10 10 5 2 2019-01-01 44 2018-05-01 0.9735399
11 11 6 1 2019-01-01 54 2018-05-01 0.7846928
12 12 6 2 2019-01-01 54 2018-05-01 0.7846928
13 13 7 1 2019-01-01 64 2018-05-01 0.5664884
14 14 7 2 2019-01-01 64 2018-05-01 0.5664884
15 15 8 1 2019-01-01 74 2018-05-01 0.4793986
16 16 8 2 2019-01-01 74 2018-05-01 0.4793986
17 17 9 1 2019-01-01 84 2018-05-01 0.6456319
18 18 9 2 2019-01-01 84 2018-05-01 0.6456319
19 19 10 1 2019-01-01 94 2018-05-01 0.9330341
20 20 10 2 2019-01-01 94 2018-05-01 0.9330341
我想加入两个数据集,A
和 B
。我想加入 A
和 B
完全在他们的 id
变量上,但只保留 B
中三个月到三岁之间的最新观察。
数据集足够大,我需要使用 sqldf
包(A
中大约有 500,000 行,B
中大约有 250,000 行)。看来逻辑应该是LEFT OUTER JOIN A AND B
和A.id = B.id
和(A.date - B.date) BETWEEN 3*30 AND 3*365
,然后GROUP BY A.row
,ORDER BY B.date DESC
,然后保持先观察。但是我下面的代码保留了第一个观察结果,而不是每个 A.row
组的第一个观察结果。
我可以分两步完成此连接(一步 sqldf
,一步 tidyverse
),但是 sqldf
可以两步完成吗?
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
library(sqldf)
#> Loading required package: gsubfn
#> Loading required package: proto
#> Loading required package: RSQLite
# Some toy data:
A <- tibble(id = rep(1:10, each = 2),
subid = rep(1:2, 10),
date = rep(ymd('2019-01-01'), 20))
A$row <- seq(nrow(A))
set.seed(42)
B <- tibble(id = rep(1:10, each = 10),
date = ymd('2015-01-01') + months(10*rep(1:10, 10)),
x = runif(100))
# This code properly matches A and B, but only returns the first observation OVERALL, not per A.row:
C <- sqldf('SELECT *
FROM A
LEFT OUTER JOIN B
ON A.id = B.id
AND (A.date - B.date) BETWEEN 3*30 and 3*365
GROUP BY row
ORDER BY B.date DESC
LIMIT 1') %>%
as_tibble()
C
#> # A tibble: 1 x 7
#> id subid date row id..5 date..6 x
#> <int> <int> <date> <int> <int> <dbl> <dbl>
#> 1 1 1 2019-01-01 1 1 17652 0.830
# I could do this in two steps, with the first step in sqldf and the second step in the tidyverse. This two step approach would work my data, because B has annual data, so there should not be more than three matches per row in A. However, it seems like I should be able to do the entire join in sqldf (and maybe one data I will not be able to do the second step in the tidyverse).
D <- sqldf('SELECT *
FROM A
LEFT OUTER JOIN B
ON A.id = B.id
AND (A.date - B.date) BETWEEN 3*30 and 3*365') %>%
as_tibble()
E <- D %>%
arrange(row, desc(date..6)) %>%
group_by(row) %>%
filter(row_number() == 1) %>%
ungroup()
# Below is the desired output. Can sqldf do both steps?
E
#> # A tibble: 20 x 7
#> id subid date row id..5 date..6 x
#> <int> <int> <date> <int> <int> <dbl> <dbl>
#> 1 1 1 2019-01-01 1 1 17652 0.830
#> 2 1 2 2019-01-01 2 1 17652 0.830
#> 3 2 1 2019-01-01 3 2 17652 0.255
#> 4 2 2 2019-01-01 4 2 17652 0.255
#> 5 3 1 2019-01-01 5 3 17652 0.947
#> 6 3 2 2019-01-01 6 3 17652 0.947
#> 7 4 1 2019-01-01 7 4 17652 0.685
#> 8 4 2 2019-01-01 8 4 17652 0.685
#> 9 5 1 2019-01-01 9 5 17652 0.974
#> 10 5 2 2019-01-01 10 5 17652 0.974
#> 11 6 1 2019-01-01 11 6 17652 0.785
#> 12 6 2 2019-01-01 12 6 17652 0.785
#> 13 7 1 2019-01-01 13 7 17652 0.566
#> 14 7 2 2019-01-01 14 7 17652 0.566
#> 15 8 1 2019-01-01 15 8 17652 0.479
#> 16 8 2 2019-01-01 16 8 17652 0.479
#> 17 9 1 2019-01-01 17 9 17652 0.646
#> 18 9 2 2019-01-01 18 9 17652 0.646
#> 19 10 1 2019-01-01 19 10 17652 0.933
#> 20 10 2 2019-01-01 20 10 17652 0.933
由 reprex package (v0.3.0)
于 2019-07-12 创建考虑 window 函数,例如 RANK()
,其中可能采用 dplyr::row_number()
(以及其他 SQL 语义,例如 select
、group_by
, case_when
). SQLite(sqldf
的默认方言)最近在版本 3.25.0(2018 年 9 月发布)中添加了对 window functions 的支持。
如果 sqldf
中不可用(取决于版本),请通过 RPostgreSQL
使用 Postgres 后端。见作者 docs。可能过早或很快,RMySQL
将成为另一个受支持的后端,因为 MySQL 8 最近添加了对 window 函数的支持。
library(RPostgreSQL)
library(sqldf)
D <- sqldf('WITH cte AS
(SELECT *,
RANK() OVER (PARTITION BY "B".row ORDER BY "B".date DESC) AS rn
FROM "A"
LEFT JOIN "B"
ON "A".id = "B".id
AND ("A".date - "B".date) BETWEEN 3*30 and 3*365
)
SELECT * FROM cte
WHERE rn = 1')
在 SQLite 中,如果您在 group by
中使用 max
或 min
,那么将使用整行,因此:
sqldf('SELECT
A.rowid as A_row,
A.id,
A.subid,
A.date as A_date__Date,
max(B.rowid) as B_row,
B.date as B_date__Date,
B.x
FROM A
LEFT OUTER JOIN B ON A.id = B.id AND (A.date - B.date) BETWEEN 3*30 AND 3*365
GROUP BY A.rowid
', method = "name__class")
给予:
A_row id subid A_date B_row B_date x
1 1 1 1 2019-01-01 4 2018-05-01 0.8304476
2 2 1 2 2019-01-01 4 2018-05-01 0.8304476
3 3 2 1 2019-01-01 14 2018-05-01 0.2554288
4 4 2 2 2019-01-01 14 2018-05-01 0.2554288
5 5 3 1 2019-01-01 24 2018-05-01 0.9466682
6 6 3 2 2019-01-01 24 2018-05-01 0.9466682
7 7 4 1 2019-01-01 34 2018-05-01 0.6851697
8 8 4 2 2019-01-01 34 2018-05-01 0.6851697
9 9 5 1 2019-01-01 44 2018-05-01 0.9735399
10 10 5 2 2019-01-01 44 2018-05-01 0.9735399
11 11 6 1 2019-01-01 54 2018-05-01 0.7846928
12 12 6 2 2019-01-01 54 2018-05-01 0.7846928
13 13 7 1 2019-01-01 64 2018-05-01 0.5664884
14 14 7 2 2019-01-01 64 2018-05-01 0.5664884
15 15 8 1 2019-01-01 74 2018-05-01 0.4793986
16 16 8 2 2019-01-01 74 2018-05-01 0.4793986
17 17 9 1 2019-01-01 84 2018-05-01 0.6456319
18 18 9 2 2019-01-01 84 2018-05-01 0.6456319
19 19 10 1 2019-01-01 94 2018-05-01 0.9330341
20 20 10 2 2019-01-01 94 2018-05-01 0.9330341