多对多加入(相同 ID 不同日期)

many to many join (same ID with different date)

我正在使用 SQL 和 R 进行分析,我想加入两个 table,如下所示:

Table 1:

ID  date
a11 20150302
a11 20150302
a22 20150303
a22 20150304
a33 20150306
a44 20150306
a55 20150307
a66 20150308
a66 20150309
a66 20150310

Table 2

ID  date
a11 20150303
a22 20150304
a22 20150305
a44 20150306
a66 20150308
a66 20150310

情况是这样的:客户接到电话 (table1),客户回电以获取更多信息 (table 两次)

所以我想在分析中做的是:

  1. 只显示在 table.
  2. 中的 ID
  3. 将 table 2 个日期与 table 1 个日期匹配:
    • 匹配最接近的日期
    • table 2 个日期必须 >= table 1 个日期 (例如结果 "a66" 20150310 被分配给 table1 日期 20150310,而 20150308 被分配给 20150308,而不是 20150309)

结果:

ID  table1 date table2 date
a11 20150302    
a11 20150302    20150303
a22 20150303    20150304
a22 20150304    20150305
a44 20150306    20150306
a66 20150308    20150308
a66 20150309    
a66 20150310    20150310

对于这个多对多(但我不想要 n*m 作为结果,我想要 1 对 1)有什么解决方案吗?matching/join?需要 R 或 SQL 中的解决方案。

谢谢

SELECT ID, Date1, Date2 FROM (
SELECT joined.ID,  joined.Date1, joined.Date2, ROW_NUMBER() OVER (PARTITION BY ID, Date1 ORDER BY Date2 ASC) AS RowNumber 
FROM(
SELECT t1.ID, t1.[Date] as Date1, CASE WHEN t2.[Date] >= t1.[Date] THEN t2.[Date] ELSE NULL END as [Date2] 
FROM Table1 t1
LEFT JOIN Table2 t2 ON t1.ID = t2.ID) as joined 
WHERE joined.Date2 IS NOT NULL
) partitioned
WHERE RowNumber = 1

加入 ID 上的两个表并删除 Table 2 中不在 Table 1 中的行。然后使用 ROW_NUMBER() OVER (PARTITION BY ID, Date1 ORDER BY Date2 ASC) 匹配由 WHERE RowNumber = 1 子句找到的最接近的日期。

生成与您列出的条件一致的输出:

+-----+----------+----------+
| ID  |  Date1   |  Date2   |
+-----+----------+----------+
| a11 | 20150302 | 20150303 |
| a22 | 20150303 | 20150304 |
| a22 | 20150304 | 20150304 |
| a44 | 20150306 | 20150306 |
| a66 | 20150308 | 20150308 |
| a66 | 20150309 | 20150310 |
| a66 | 20150310 | 20150310 |
+-----+----------+----------+

我在 dplyr 中得到了与 R 中的 markmanguy 相同的结果。对于a22,最接近20150304初始调用的回调是20150304,不是20150305,需要时间成分来区分。

library(dplyr)
inner_join(table1,table2,"ID")%>%
group_by(ID,date1)%>%
filter(date1<=date2)%>%
filter(row_number() == 1)

>
Source: local data frame [7 x 3]
Groups: ID, date1 [7]

     ID    date1    date2
  (chr)    (int)    (int)
1   a11 20150302 20150303
2   a22 20150303 20150304
3   a22 20150304 20150304
4   a44 20150306 20150306
5   a66 20150308 20150308
6   a66 20150309 20150310
7   a66 20150310 20150310

数据

table1 <-read.table(text="ID  date1
a11 20150302
a11 20150302
a22 20150303
a22 20150304
a33 20150306
a44 20150306
a55 20150307
a66 20150308
a66 20150309
a66 20150310", header=T,stringsAsFactors =F)
table2 <-read.table(text="ID  date2
a11 20150303
a22 20150304
a22 20150305
a44 20150306
a66 20150308
a66 20150310", header=T,stringsAsFactors =F)

这并没有解决它,但很接近,也许会给你一个想法

SqlFiddleDemo

With t_left as (
    SELECT *, row_number() over (partition by "ID" order by date desc ) as rn
    FROM Table1 T
    WHERE EXISTS (SELECT 1 FROM Table2 P WHERE T."ID" = P."ID")
),
t_right as (
    SELECT *, row_number() over (partition by "ID" order by date desc) as rn
    FROM Table2
) 
SELECT t_left."ID", t_left."date", t_right."date"
FROM t_left
LEFT JOIN t_right
       on t_left.rn = t_right.rn
      and t_left."ID" = t_right."ID"
ORDER BY t_left."ID", t_left."date"

输出

|  ID |     date |     date |
|-----|----------|----------|
| a11 | 20150302 | 20150303 |
| a11 | 20150302 |   (null) |
| a22 | 20150303 | 20150304 |
| a22 | 20150304 | 20150305 |
| a44 | 20150306 | 20150306 |
| a66 | 20150308 |   (null) |
| a66 | 20150309 | 20150308 |
| a66 | 20150310 | 20150310 |