加入 table 与相同 table 的增量数据
Join the table with incremental data of the same table
我正在尝试在 Redshift Spectrum 中实现一个逻辑,我的原始 table 如下所示:
学生中的记录table:
1 || student1 || Boston || 2019-01-01
2 || student2 || New York || 2019-02-01
3 || student3 || Chicago || 2019-03-01
1 || student1 || Dallas || 2019-03-01
增量 table studentinc 中的记录如下所示:
1 || student1 || SFO || 2019-04-01
4 || student4 || Detroit || 2019-04-01
通过加入 student 和 studentinc tables,我正在尝试获取最新的记录集,如下所示:
2 || student2 || New York || 2019-02-01
3 || student3 || Chicago || 2019-03-01
1 || student1 || SFO || 2019-04-01
4 || student4 || Detroit || 2019-04-01
我通过对 student 和 studentinc 进行 UNION,然后根据 max(modified_ts) 查询并集的结果得到了这个解决方案。但是,此解决方案不适用于巨大的 tables,是否有更好的解决方案可以通过加入两个 tables 来工作?
1. Using Spark-SQL you can achieve this by using not in and union
scala> var df1 = Seq((1 ,"student1","Boston " , "2019-01-01" ),(2 ,"student2","New York" , "2019-02-01"),(3 ,"student3","Chicago " , "2019-03-01" ),(1 ,"student1","Dallas " , "2019-03-01")).toDF("id","name","country","_date")
注册为临时工table
scala> df1.registerTempTable("temp1")
scala> sql("select * from temp1") .show
+---+--------+--------+----------+
| id| name| country| _date|
+---+--------+--------+----------+
| 1|student1|Boston |2019-01-01|
| 2|student2|New York|2019-02-01|
| 3|student3|Chicago |2019-03-01|
| 1|student1|Dallas |2019-03-01|
+---+--------+--------+----------+
第二个数据帧
scala> var df3 = Seq((1 , "student1", "SFO", "2019-04-01"),(4 , "student4", "Detroit", "2019-04-01")).toDF("id","name","country","_date")
scala> df3.show
+---+--------+-------+----------+
| id| name|country| _date|
+---+--------+-------+----------+
| 1|student1| SFO|2019-04-01|
| 4|student4|Detroit|2019-04-01|
+---+--------+-------+----------+
执行不与联合条款
scala> sql("select * from (select * from temp1 where id not in (select id from temp2 ) )tt") .union(df3).show
+---+--------+--------+----------+
| id| name| country| _date|
+---+--------+--------+----------+
| 2|student2|New York|2019-02-01|
| 3|student3|Chicago |2019-03-01|
| 1|student1| SFO|2019-04-01|
| 4|student4| Detroit|2019-04-01|
+---+--------+--------+----------+
第二次使用 Spark Dataframe 这比 IN 查询更快,因为 IN 执行逐行操作。
scala> df1.join(df3,Seq("id"),"left_anti").union (df3).show
+---+--------+--------+----------+
| id| name| country| _date|
+---+--------+--------+----------+
| 2|student2|New York|2019-02-01|
| 3|student3|Chicago |2019-03-01|
| 1|student1| SFO|2019-04-01|
| 4|student4| Detroit|2019-04-01|
+---+--------+--------+----------+
希望对您有所帮助。如果您有任何与此相关的问题,请告诉我
我会推荐 window 函数:
select s.*
from (select s.*,
row_number() over (partition by studentid order by date desc) as seqnum
from ((select s.* from student
) union all
(select i.* from incremental
from incremental
)
) s
) s
where seqnum = 1;
注意:union all
要求列完全相同且顺序相同。如果它们不相同,您可能需要列出这些列。
我正在尝试在 Redshift Spectrum 中实现一个逻辑,我的原始 table 如下所示:
学生中的记录table:
1 || student1 || Boston || 2019-01-01
2 || student2 || New York || 2019-02-01
3 || student3 || Chicago || 2019-03-01
1 || student1 || Dallas || 2019-03-01
增量 table studentinc 中的记录如下所示:
1 || student1 || SFO || 2019-04-01
4 || student4 || Detroit || 2019-04-01
通过加入 student 和 studentinc tables,我正在尝试获取最新的记录集,如下所示:
2 || student2 || New York || 2019-02-01
3 || student3 || Chicago || 2019-03-01
1 || student1 || SFO || 2019-04-01
4 || student4 || Detroit || 2019-04-01
我通过对 student 和 studentinc 进行 UNION,然后根据 max(modified_ts) 查询并集的结果得到了这个解决方案。但是,此解决方案不适用于巨大的 tables,是否有更好的解决方案可以通过加入两个 tables 来工作?
1. Using Spark-SQL you can achieve this by using not in and union
scala> var df1 = Seq((1 ,"student1","Boston " , "2019-01-01" ),(2 ,"student2","New York" , "2019-02-01"),(3 ,"student3","Chicago " , "2019-03-01" ),(1 ,"student1","Dallas " , "2019-03-01")).toDF("id","name","country","_date")
注册为临时工table
scala> df1.registerTempTable("temp1")
scala> sql("select * from temp1") .show
+---+--------+--------+----------+
| id| name| country| _date|
+---+--------+--------+----------+
| 1|student1|Boston |2019-01-01|
| 2|student2|New York|2019-02-01|
| 3|student3|Chicago |2019-03-01|
| 1|student1|Dallas |2019-03-01|
+---+--------+--------+----------+
第二个数据帧
scala> var df3 = Seq((1 , "student1", "SFO", "2019-04-01"),(4 , "student4", "Detroit", "2019-04-01")).toDF("id","name","country","_date")
scala> df3.show
+---+--------+-------+----------+
| id| name|country| _date|
+---+--------+-------+----------+
| 1|student1| SFO|2019-04-01|
| 4|student4|Detroit|2019-04-01|
+---+--------+-------+----------+
执行不与联合条款
scala> sql("select * from (select * from temp1 where id not in (select id from temp2 ) )tt") .union(df3).show
+---+--------+--------+----------+
| id| name| country| _date|
+---+--------+--------+----------+
| 2|student2|New York|2019-02-01|
| 3|student3|Chicago |2019-03-01|
| 1|student1| SFO|2019-04-01|
| 4|student4| Detroit|2019-04-01|
+---+--------+--------+----------+
第二次使用 Spark Dataframe 这比 IN 查询更快,因为 IN 执行逐行操作。
scala> df1.join(df3,Seq("id"),"left_anti").union (df3).show
+---+--------+--------+----------+
| id| name| country| _date|
+---+--------+--------+----------+
| 2|student2|New York|2019-02-01|
| 3|student3|Chicago |2019-03-01|
| 1|student1| SFO|2019-04-01|
| 4|student4| Detroit|2019-04-01|
+---+--------+--------+----------+
希望对您有所帮助。如果您有任何与此相关的问题,请告诉我
我会推荐 window 函数:
select s.*
from (select s.*,
row_number() over (partition by studentid order by date desc) as seqnum
from ((select s.* from student
) union all
(select i.* from incremental
from incremental
)
) s
) s
where seqnum = 1;
注意:union all
要求列完全相同且顺序相同。如果它们不相同,您可能需要列出这些列。