离开加入团体?
Left join with groups?
假设我有 2 个像这样的 pyspark dfs:
minute_df
minute_time | val
----------------
00:00 | 4
00:01 | 5
data_df
minute_time | currency | someOtherVal
------------------------------------
00:00 | USD | 20
00:01 | USD | 12
00:00 | CAD | 14
注意 CAD
没有 00:01
的行,最终结果应该是这样的:
joined_df
minute_time | currency | val. | someOtherVal
------------------------------------
00:00 | USD | 4 | 20
00:01 | USD | 5 | 12
00:00 | CAD | 4 | 14
00:01 | CAD | 5 | NULL // so in the final result CAD 00:01 should be there
如果没有货币,这将非常简单,它只是一个左连接,如:
SELECT a.*, b.* from minute_df LEFT JOIN data_df b on a.minute_pt = b.minute_pt
当 data_df
没有某些 minute_pt
的记录时,货币情况会变得棘手,但在最终结果中,我们希望 minute_df
中的每个 minute_pt
] 对于 data_df
.
中的每个 currency
如何实现?
您可以使用来自 data_df
的不同货币交叉连接 minute_df
以获得所有关联(minute_time
、currency
),然后再像这样进行左连接:
spark.createDataFrame(
[("00:00", "USD", 20), ("00:01", "USD", 12), ("00:00", "CAD", 14)],
["minute_time", "currency", "someOtherVal"]
).createOrReplaceTempView("data_df")
spark.createDataFrame([("00:00", 4), ("00:01", 5)], ["minute_time", "val"]).createOrReplaceTempView("minute_df")
spark.sql("""
WITH minute_currency_df AS (
SELECT *
FROM minute_df
CROSS JOIN (SELECT DISTINCT currency FROM data_df)
)
SELECT m.*, d.someOtherVal
FROM minute_currency_df m
LEFT JOIN data_df d
ON m.minute_time = d.minute_time
AND m.currency = d.currency
""").show()
#+-----------+---+--------+------------+
#|minute_time|val|currency|someOtherVal|
#+-----------+---+--------+------------+
#| 00:01| 5| CAD| null|
#| 00:01| 5| USD| 12|
#| 00:00| 4| USD| 20|
#| 00:00| 4| CAD| 14|
#+-----------+---+--------+------------+
假设我有 2 个像这样的 pyspark dfs:
minute_df
minute_time | val
----------------
00:00 | 4
00:01 | 5
data_df
minute_time | currency | someOtherVal
------------------------------------
00:00 | USD | 20
00:01 | USD | 12
00:00 | CAD | 14
注意 CAD
没有 00:01
的行,最终结果应该是这样的:
joined_df
minute_time | currency | val. | someOtherVal
------------------------------------
00:00 | USD | 4 | 20
00:01 | USD | 5 | 12
00:00 | CAD | 4 | 14
00:01 | CAD | 5 | NULL // so in the final result CAD 00:01 should be there
如果没有货币,这将非常简单,它只是一个左连接,如:
SELECT a.*, b.* from minute_df LEFT JOIN data_df b on a.minute_pt = b.minute_pt
当 data_df
没有某些 minute_pt
的记录时,货币情况会变得棘手,但在最终结果中,我们希望 minute_df
中的每个 minute_pt
] 对于 data_df
.
currency
如何实现?
您可以使用来自 data_df
的不同货币交叉连接 minute_df
以获得所有关联(minute_time
、currency
),然后再像这样进行左连接:
spark.createDataFrame(
[("00:00", "USD", 20), ("00:01", "USD", 12), ("00:00", "CAD", 14)],
["minute_time", "currency", "someOtherVal"]
).createOrReplaceTempView("data_df")
spark.createDataFrame([("00:00", 4), ("00:01", 5)], ["minute_time", "val"]).createOrReplaceTempView("minute_df")
spark.sql("""
WITH minute_currency_df AS (
SELECT *
FROM minute_df
CROSS JOIN (SELECT DISTINCT currency FROM data_df)
)
SELECT m.*, d.someOtherVal
FROM minute_currency_df m
LEFT JOIN data_df d
ON m.minute_time = d.minute_time
AND m.currency = d.currency
""").show()
#+-----------+---+--------+------------+
#|minute_time|val|currency|someOtherVal|
#+-----------+---+--------+------------+
#| 00:01| 5| CAD| null|
#| 00:01| 5| USD| 12|
#| 00:00| 4| USD| 20|
#| 00:00| 4| CAD| 14|
#+-----------+---+--------+------------+