多对多加入(相同 ID 不同日期)
many to many join (same ID with different date)
我正在使用 SQL 和 R 进行分析,我想加入两个 table,如下所示:
Table 1:
ID date
a11 20150302
a11 20150302
a22 20150303
a22 20150304
a33 20150306
a44 20150306
a55 20150307
a66 20150308
a66 20150309
a66 20150310
Table 2
ID date
a11 20150303
a22 20150304
a22 20150305
a44 20150306
a66 20150308
a66 20150310
情况是这样的:客户接到电话 (table1),客户回电以获取更多信息 (table 两次)
所以我想在分析中做的是:
- 只显示在 table.
中的 ID
- 将 table 2 个日期与 table 1 个日期匹配:
- 匹配最接近的日期
- table 2 个日期必须 >= table 1 个日期
(例如结果 "a66" 20150310 被分配给 table1 日期 20150310,而 20150308 被分配给 20150308,而不是 20150309)
结果:
ID table1 date table2 date
a11 20150302
a11 20150302 20150303
a22 20150303 20150304
a22 20150304 20150305
a44 20150306 20150306
a66 20150308 20150308
a66 20150309
a66 20150310 20150310
对于这个多对多(但我不想要 n*m 作为结果,我想要 1 对 1)有什么解决方案吗?matching/join?需要 R 或 SQL 中的解决方案。
谢谢
SELECT ID, Date1, Date2 FROM (
SELECT joined.ID, joined.Date1, joined.Date2, ROW_NUMBER() OVER (PARTITION BY ID, Date1 ORDER BY Date2 ASC) AS RowNumber
FROM(
SELECT t1.ID, t1.[Date] as Date1, CASE WHEN t2.[Date] >= t1.[Date] THEN t2.[Date] ELSE NULL END as [Date2]
FROM Table1 t1
LEFT JOIN Table2 t2 ON t1.ID = t2.ID) as joined
WHERE joined.Date2 IS NOT NULL
) partitioned
WHERE RowNumber = 1
加入 ID
上的两个表并删除 Table 2
中不在 Table 1
中的行。然后使用 ROW_NUMBER() OVER (PARTITION BY ID, Date1 ORDER BY Date2 ASC)
匹配由 WHERE RowNumber = 1
子句找到的最接近的日期。
生成与您列出的条件一致的输出:
+-----+----------+----------+
| ID | Date1 | Date2 |
+-----+----------+----------+
| a11 | 20150302 | 20150303 |
| a22 | 20150303 | 20150304 |
| a22 | 20150304 | 20150304 |
| a44 | 20150306 | 20150306 |
| a66 | 20150308 | 20150308 |
| a66 | 20150309 | 20150310 |
| a66 | 20150310 | 20150310 |
+-----+----------+----------+
我在 dplyr
中得到了与 R 中的 markmanguy 相同的结果。对于a22,最接近20150304初始调用的回调是20150304,不是20150305,需要时间成分来区分。
library(dplyr)
inner_join(table1,table2,"ID")%>%
group_by(ID,date1)%>%
filter(date1<=date2)%>%
filter(row_number() == 1)
>
Source: local data frame [7 x 3]
Groups: ID, date1 [7]
ID date1 date2
(chr) (int) (int)
1 a11 20150302 20150303
2 a22 20150303 20150304
3 a22 20150304 20150304
4 a44 20150306 20150306
5 a66 20150308 20150308
6 a66 20150309 20150310
7 a66 20150310 20150310
数据
table1 <-read.table(text="ID date1
a11 20150302
a11 20150302
a22 20150303
a22 20150304
a33 20150306
a44 20150306
a55 20150307
a66 20150308
a66 20150309
a66 20150310", header=T,stringsAsFactors =F)
table2 <-read.table(text="ID date2
a11 20150303
a22 20150304
a22 20150305
a44 20150306
a66 20150308
a66 20150310", header=T,stringsAsFactors =F)
这并没有解决它,但很接近,也许会给你一个想法
With t_left as (
SELECT *, row_number() over (partition by "ID" order by date desc ) as rn
FROM Table1 T
WHERE EXISTS (SELECT 1 FROM Table2 P WHERE T."ID" = P."ID")
),
t_right as (
SELECT *, row_number() over (partition by "ID" order by date desc) as rn
FROM Table2
)
SELECT t_left."ID", t_left."date", t_right."date"
FROM t_left
LEFT JOIN t_right
on t_left.rn = t_right.rn
and t_left."ID" = t_right."ID"
ORDER BY t_left."ID", t_left."date"
输出
| ID | date | date |
|-----|----------|----------|
| a11 | 20150302 | 20150303 |
| a11 | 20150302 | (null) |
| a22 | 20150303 | 20150304 |
| a22 | 20150304 | 20150305 |
| a44 | 20150306 | 20150306 |
| a66 | 20150308 | (null) |
| a66 | 20150309 | 20150308 |
| a66 | 20150310 | 20150310 |
我正在使用 SQL 和 R 进行分析,我想加入两个 table,如下所示:
Table 1:
ID date
a11 20150302
a11 20150302
a22 20150303
a22 20150304
a33 20150306
a44 20150306
a55 20150307
a66 20150308
a66 20150309
a66 20150310
Table 2
ID date
a11 20150303
a22 20150304
a22 20150305
a44 20150306
a66 20150308
a66 20150310
情况是这样的:客户接到电话 (table1),客户回电以获取更多信息 (table 两次)
所以我想在分析中做的是:
- 只显示在 table. 中的 ID
- 将 table 2 个日期与 table 1 个日期匹配:
- 匹配最接近的日期
- table 2 个日期必须 >= table 1 个日期 (例如结果 "a66" 20150310 被分配给 table1 日期 20150310,而 20150308 被分配给 20150308,而不是 20150309)
结果:
ID table1 date table2 date
a11 20150302
a11 20150302 20150303
a22 20150303 20150304
a22 20150304 20150305
a44 20150306 20150306
a66 20150308 20150308
a66 20150309
a66 20150310 20150310
对于这个多对多(但我不想要 n*m 作为结果,我想要 1 对 1)有什么解决方案吗?matching/join?需要 R 或 SQL 中的解决方案。
谢谢
SELECT ID, Date1, Date2 FROM (
SELECT joined.ID, joined.Date1, joined.Date2, ROW_NUMBER() OVER (PARTITION BY ID, Date1 ORDER BY Date2 ASC) AS RowNumber
FROM(
SELECT t1.ID, t1.[Date] as Date1, CASE WHEN t2.[Date] >= t1.[Date] THEN t2.[Date] ELSE NULL END as [Date2]
FROM Table1 t1
LEFT JOIN Table2 t2 ON t1.ID = t2.ID) as joined
WHERE joined.Date2 IS NOT NULL
) partitioned
WHERE RowNumber = 1
加入 ID
上的两个表并删除 Table 2
中不在 Table 1
中的行。然后使用 ROW_NUMBER() OVER (PARTITION BY ID, Date1 ORDER BY Date2 ASC)
匹配由 WHERE RowNumber = 1
子句找到的最接近的日期。
生成与您列出的条件一致的输出:
+-----+----------+----------+
| ID | Date1 | Date2 |
+-----+----------+----------+
| a11 | 20150302 | 20150303 |
| a22 | 20150303 | 20150304 |
| a22 | 20150304 | 20150304 |
| a44 | 20150306 | 20150306 |
| a66 | 20150308 | 20150308 |
| a66 | 20150309 | 20150310 |
| a66 | 20150310 | 20150310 |
+-----+----------+----------+
我在 dplyr
中得到了与 R 中的 markmanguy 相同的结果。对于a22,最接近20150304初始调用的回调是20150304,不是20150305,需要时间成分来区分。
library(dplyr)
inner_join(table1,table2,"ID")%>%
group_by(ID,date1)%>%
filter(date1<=date2)%>%
filter(row_number() == 1)
>
Source: local data frame [7 x 3]
Groups: ID, date1 [7]
ID date1 date2
(chr) (int) (int)
1 a11 20150302 20150303
2 a22 20150303 20150304
3 a22 20150304 20150304
4 a44 20150306 20150306
5 a66 20150308 20150308
6 a66 20150309 20150310
7 a66 20150310 20150310
数据
table1 <-read.table(text="ID date1
a11 20150302
a11 20150302
a22 20150303
a22 20150304
a33 20150306
a44 20150306
a55 20150307
a66 20150308
a66 20150309
a66 20150310", header=T,stringsAsFactors =F)
table2 <-read.table(text="ID date2
a11 20150303
a22 20150304
a22 20150305
a44 20150306
a66 20150308
a66 20150310", header=T,stringsAsFactors =F)
这并没有解决它,但很接近,也许会给你一个想法
With t_left as (
SELECT *, row_number() over (partition by "ID" order by date desc ) as rn
FROM Table1 T
WHERE EXISTS (SELECT 1 FROM Table2 P WHERE T."ID" = P."ID")
),
t_right as (
SELECT *, row_number() over (partition by "ID" order by date desc) as rn
FROM Table2
)
SELECT t_left."ID", t_left."date", t_right."date"
FROM t_left
LEFT JOIN t_right
on t_left.rn = t_right.rn
and t_left."ID" = t_right."ID"
ORDER BY t_left."ID", t_left."date"
输出
| ID | date | date |
|-----|----------|----------|
| a11 | 20150302 | 20150303 |
| a11 | 20150302 | (null) |
| a22 | 20150303 | 20150304 |
| a22 | 20150304 | 20150305 |
| a44 | 20150306 | 20150306 |
| a66 | 20150308 | (null) |
| a66 | 20150309 | 20150308 |
| a66 | 20150310 | 20150310 |