离开加入团体?

Left join with groups?

假设我有 2 个像这样的 pyspark dfs:

minute_df

minute_time | val
----------------
00:00       | 4
00:01       | 5

data_df

minute_time | currency | someOtherVal
------------------------------------
00:00       | USD      | 20
00:01       | USD      | 12
00:00       | CAD      | 14

注意 CAD 没有 00:01 的行,最终结果应该是这样的:

joined_df

minute_time | currency | val. | someOtherVal
------------------------------------
00:00       | USD      |  4   | 20
00:01       | USD      |  5   | 12
00:00       | CAD      |  4   | 14
00:01       | CAD      |  5   | NULL // so in the final result CAD 00:01 should be there

如果没有货币,这将非常简单,它只是一个左连接,如:

SELECT a.*, b.* from minute_df LEFT JOIN data_df b on a.minute_pt = b.minute_pt

data_df 没有某些 minute_pt 的记录时,货币情况会变得棘手,但在最终结果中,我们希望 minute_df 中的每个 minute_pt ] 对于 data_df.

中的每个 currency

如何实现?

您可以使用来自 data_df 的不同货币交叉连接 minute_df 以获得所有关联(minute_timecurrency),然后再像这样进行左连接:

spark.createDataFrame(
    [("00:00", "USD", 20), ("00:01", "USD", 12), ("00:00", "CAD", 14)],
    ["minute_time", "currency", "someOtherVal"]
).createOrReplaceTempView("data_df")

spark.createDataFrame([("00:00", 4), ("00:01", 5)], ["minute_time", "val"]).createOrReplaceTempView("minute_df")

spark.sql("""
WITH minute_currency_df AS (
    SELECT  *
    FROM    minute_df
    CROSS JOIN (SELECT DISTINCT currency FROM data_df)
)

SELECT  m.*, d.someOtherVal 
FROM    minute_currency_df m
LEFT JOIN  data_df d
ON      m.minute_time = d.minute_time
AND     m.currency = d.currency
""").show()

#+-----------+---+--------+------------+
#|minute_time|val|currency|someOtherVal|
#+-----------+---+--------+------------+
#|      00:01|  5|     CAD|        null|
#|      00:01|  5|     USD|          12|
#|      00:00|  4|     USD|          20|
#|      00:00|  4|     CAD|          14|
#+-----------+---+--------+------------+