Pyspark,合并多个数据帧(外部连接)并仅保留主键的一次出现(基于两列/键连接)
Pyspark, merging multiple dataframes (outer join) and keeping only a single occurance of the primary key (joined on the basis of two columns / key)
我有两个数据框
df2
+----------+-------------------+-------------------+-------------+
|Event_Type| start| end|agg_sum_10_15|
+----------+-------------------+-------------------+-------------+
| event1|2016-04-25 05:30:00|2016-05-02 05:30:00| 1.0|
| event1|2016-05-09 05:30:00|2016-05-16 05:30:00| 3.0|
| event1|2016-06-06 05:30:00|2016-06-13 05:30:00| 3.0|
| event2|2016-05-09 05:30:00|2016-05-16 05:30:00| 1.0|
| event2|2016-06-06 05:30:00|2016-06-13 05:30:00| 1.0|
| event3|2016-05-16 05:30:00|2016-05-23 05:30:00| 1.0|
| event3|2016-06-13 05:30:00|2016-06-20 05:30:00| 1.0|
+----------+-------------------+-------------------+-------------+
和 df3
+----------+-------------------+-------------------+--------------+
|Event_Type| start| end|agg_sum_15_110|
+----------+-------------------+-------------------+--------------+
| event1|2016-04-25 05:30:00|2016-05-02 05:30:00| 1.0|
| event1|2016-05-30 05:30:00|2016-06-06 05:30:00| 1.0|
| event2|2016-05-02 05:30:00|2016-05-09 05:30:00| 2.0|
| event2|2016-05-16 05:30:00|2016-05-23 05:30:00| 2.0|
| event3|2016-05-02 05:30:00|2016-05-09 05:30:00| 11.0|
| event3|2016-05-23 05:30:00|2016-05-30 05:30:00| 1.0|
+----------+-------------------+-------------------+--------------+
可能有多个数据框,匹配所依据的键/列是'Event_type'和'start',当我连接它们(外部连接)时,列重复。有没有办法让列只有一次,在没有匹配项的地方填充空值
外连接的目的,只要有匹配(基于键)就应该有单行,如果没有匹配则添加额外的行(缺失值的空值)
使用以下代码加入
dftotal = df2.join(df3,((df2.Event_Type == df3.Event_Type) & (df2.start == df3.start )), 'outer'). The above code gives the following output
+----------+-------------------+-------------------+-------------+----------+-------------------+-------------------+--------------+
|Event_Type| start| end|agg_sum_10_15|Event_Type| start| end|agg_sum_15_110|
+----------+-------------------+-------------------+-------------+----------+-------------------+-------------------+--------------+
| null| null| null| null| event3|2016-05-23 05:30:00|2016-05-30 05:30:00| 1.0|
| event2|2016-05-09 05:30:00|2016-05-16 05:30:00| 1.0| null| null| null| null|
| event1|2016-05-09 05:30:00|2016-05-16 05:30:00| 3.0| null| null| null| null|
| event3|2016-05-16 05:30:00|2016-05-23 05:30:00| 1.0| null| null| null| null|
| null| null| null| null| event1|2016-05-30 05:30:00|2016-06-06 05:30:00| 1.0|
| null| null| null| null| event2|2016-05-02 05:30:00|2016-05-09 05:30:00| 2.0|
| null| null| null| null| event3|2016-05-02 05:30:00|2016-05-09 05:30:00| 11.0|
| event2|2016-06-06 05:30:00|2016-06-13 05:30:00| 1.0| null| null| null| null|
| event3|2016-06-13 05:30:00|2016-06-20 05:30:00| 1.0| null| null| null| null|
| null| null| null| null| event2|2016-05-16 05:30:00|2016-05-23 05:30:00| 2.0|
| event1|2016-06-06 05:30:00|2016-06-13 05:30:00| 3.0| null| null| null| null|
| event1|2016-04-25 05:30:00|2016-05-02 05:30:00| 1.0| event1|2016-04-25 05:30:00|2016-05-02 05:30:00| 1.0|
+----------+-------------------+-------------------+-------------+----------+-------------------+-------------------+--------------+
我想要一个 'Event_type' 列。第一个 'Event_Type' 中的空值从第二个 'Event_type' 列获取值,对于起始字段也是如此。希望它能解释所需的输出
我在某处读到 'coalesce' 命令可能会有帮助
你是对的。 Coalesce
就是你要找的那个。
>>> from pyspark.sql.functions import *
>>> dftotal = df2.join(df3,((df2.Event_Type == df3.Event_Type) & (df2.start == df3.start )), 'outer').select(coalesce(df2.Event_Type,df3.Event_Type),coalesce(df2.start,df3.start),df2.end,df2.agg_sum_10_15,df3.end,df3.agg_sum_15_110)
>>> dftotal.show()
+--------------------------------+----------------------+-------------------+-------------+-------------------+--------------+
|coalesce(Event_Type, Event_Type)|coalesce(start, start)| end|agg_sum_10_15| end|agg_sum_15_110|
+--------------------------------+----------------------+-------------------+-------------+-------------------+--------------+
| event1| 2016-05-09 05:30:00|2016-05-16 05:30:00| 3.0| null| null|
| event1| 2016-06-06 05:30:00|2016-06-13 05:30:00| 3.0| null| null|
| event2| 2016-05-02 05:30:00| null| null|2016-05-09 05:30:00| 2.0|
| event3| 2016-05-02 05:30:00| null| null|2016-05-09 05:30:00| 11.0|
| event2| 2016-05-16 05:30:00| null| null|2016-05-23 05:30:00| 2.0|
| event1| 2016-05-30 05:30:00| null| null|2016-06-06 05:30:00| 1.0|
| event3| 2016-05-16 05:30:00|2016-05-23 05:30:00| 1.0| null| null|
| event2| 2016-06-06 05:30:00|2016-06-13 05:30:00| 1.0| null| null|
| event1| 2016-04-25 05:30:00|2016-05-02 05:30:00| 1.0|2016-05-02 05:30:00| 1.0|
| event3| 2016-06-13 05:30:00|2016-06-20 05:30:00| 1.0| null| null|
| event3| 2016-05-23 05:30:00| null| null|2016-05-30 05:30:00| 1.0|
| event2| 2016-05-09 05:30:00|2016-05-16 05:30:00| 1.0| null| null|
+--------------------------------+----------------------+-------------------+-------------+-------------------+--------------+
我有两个数据框 df2
+----------+-------------------+-------------------+-------------+
|Event_Type| start| end|agg_sum_10_15|
+----------+-------------------+-------------------+-------------+
| event1|2016-04-25 05:30:00|2016-05-02 05:30:00| 1.0|
| event1|2016-05-09 05:30:00|2016-05-16 05:30:00| 3.0|
| event1|2016-06-06 05:30:00|2016-06-13 05:30:00| 3.0|
| event2|2016-05-09 05:30:00|2016-05-16 05:30:00| 1.0|
| event2|2016-06-06 05:30:00|2016-06-13 05:30:00| 1.0|
| event3|2016-05-16 05:30:00|2016-05-23 05:30:00| 1.0|
| event3|2016-06-13 05:30:00|2016-06-20 05:30:00| 1.0|
+----------+-------------------+-------------------+-------------+
和 df3
+----------+-------------------+-------------------+--------------+
|Event_Type| start| end|agg_sum_15_110|
+----------+-------------------+-------------------+--------------+
| event1|2016-04-25 05:30:00|2016-05-02 05:30:00| 1.0|
| event1|2016-05-30 05:30:00|2016-06-06 05:30:00| 1.0|
| event2|2016-05-02 05:30:00|2016-05-09 05:30:00| 2.0|
| event2|2016-05-16 05:30:00|2016-05-23 05:30:00| 2.0|
| event3|2016-05-02 05:30:00|2016-05-09 05:30:00| 11.0|
| event3|2016-05-23 05:30:00|2016-05-30 05:30:00| 1.0|
+----------+-------------------+-------------------+--------------+
可能有多个数据框,匹配所依据的键/列是'Event_type'和'start',当我连接它们(外部连接)时,列重复。有没有办法让列只有一次,在没有匹配项的地方填充空值
外连接的目的,只要有匹配(基于键)就应该有单行,如果没有匹配则添加额外的行(缺失值的空值)
使用以下代码加入
dftotal = df2.join(df3,((df2.Event_Type == df3.Event_Type) & (df2.start == df3.start )), 'outer'). The above code gives the following output
+----------+-------------------+-------------------+-------------+----------+-------------------+-------------------+--------------+
|Event_Type| start| end|agg_sum_10_15|Event_Type| start| end|agg_sum_15_110|
+----------+-------------------+-------------------+-------------+----------+-------------------+-------------------+--------------+
| null| null| null| null| event3|2016-05-23 05:30:00|2016-05-30 05:30:00| 1.0|
| event2|2016-05-09 05:30:00|2016-05-16 05:30:00| 1.0| null| null| null| null|
| event1|2016-05-09 05:30:00|2016-05-16 05:30:00| 3.0| null| null| null| null|
| event3|2016-05-16 05:30:00|2016-05-23 05:30:00| 1.0| null| null| null| null|
| null| null| null| null| event1|2016-05-30 05:30:00|2016-06-06 05:30:00| 1.0|
| null| null| null| null| event2|2016-05-02 05:30:00|2016-05-09 05:30:00| 2.0|
| null| null| null| null| event3|2016-05-02 05:30:00|2016-05-09 05:30:00| 11.0|
| event2|2016-06-06 05:30:00|2016-06-13 05:30:00| 1.0| null| null| null| null|
| event3|2016-06-13 05:30:00|2016-06-20 05:30:00| 1.0| null| null| null| null|
| null| null| null| null| event2|2016-05-16 05:30:00|2016-05-23 05:30:00| 2.0|
| event1|2016-06-06 05:30:00|2016-06-13 05:30:00| 3.0| null| null| null| null|
| event1|2016-04-25 05:30:00|2016-05-02 05:30:00| 1.0| event1|2016-04-25 05:30:00|2016-05-02 05:30:00| 1.0|
+----------+-------------------+-------------------+-------------+----------+-------------------+-------------------+--------------+
我想要一个 'Event_type' 列。第一个 'Event_Type' 中的空值从第二个 'Event_type' 列获取值,对于起始字段也是如此。希望它能解释所需的输出 我在某处读到 'coalesce' 命令可能会有帮助
你是对的。 Coalesce
就是你要找的那个。
>>> from pyspark.sql.functions import *
>>> dftotal = df2.join(df3,((df2.Event_Type == df3.Event_Type) & (df2.start == df3.start )), 'outer').select(coalesce(df2.Event_Type,df3.Event_Type),coalesce(df2.start,df3.start),df2.end,df2.agg_sum_10_15,df3.end,df3.agg_sum_15_110)
>>> dftotal.show()
+--------------------------------+----------------------+-------------------+-------------+-------------------+--------------+
|coalesce(Event_Type, Event_Type)|coalesce(start, start)| end|agg_sum_10_15| end|agg_sum_15_110|
+--------------------------------+----------------------+-------------------+-------------+-------------------+--------------+
| event1| 2016-05-09 05:30:00|2016-05-16 05:30:00| 3.0| null| null|
| event1| 2016-06-06 05:30:00|2016-06-13 05:30:00| 3.0| null| null|
| event2| 2016-05-02 05:30:00| null| null|2016-05-09 05:30:00| 2.0|
| event3| 2016-05-02 05:30:00| null| null|2016-05-09 05:30:00| 11.0|
| event2| 2016-05-16 05:30:00| null| null|2016-05-23 05:30:00| 2.0|
| event1| 2016-05-30 05:30:00| null| null|2016-06-06 05:30:00| 1.0|
| event3| 2016-05-16 05:30:00|2016-05-23 05:30:00| 1.0| null| null|
| event2| 2016-06-06 05:30:00|2016-06-13 05:30:00| 1.0| null| null|
| event1| 2016-04-25 05:30:00|2016-05-02 05:30:00| 1.0|2016-05-02 05:30:00| 1.0|
| event3| 2016-06-13 05:30:00|2016-06-20 05:30:00| 1.0| null| null|
| event3| 2016-05-23 05:30:00| null| null|2016-05-30 05:30:00| 1.0|
| event2| 2016-05-09 05:30:00|2016-05-16 05:30:00| 1.0| null| null|
+--------------------------------+----------------------+-------------------+-------------+-------------------+--------------+