加入 pyspark 数据框中数组列内的项目

Join on items inside an array column in pyspark dataframe

如何从df1df2获取df3?如果 df2.bdf1.b

的项目列表中,则匹配
----------------        --------------          -------------------------------
| a |    b     |        | b  |   c   |          | a |    b    |      c        |       
----------------        --------------     =>   -------------------------------
| 2 | [3,4]    |        | 3  | Three |          | 2 |  [3, 4] | [Three, Four] |
| 3 | [4]      |        | 4  | Four  |          | 3 |  [4]    | [Four]        |
----------------        --------------          -------------------------------
  df1                         df2                            df3

在条件中使用 array_contains 连接,然后在 c 列上按 acollect_list 分组:

import pyspark.sql.functions as F

df1 = spark.createDataFrame([(2, [3, 4]), (3, [4])], ["a", "b"])
df2 = spark.createDataFrame([(3, "Three"), (4, "Four")], ["b", "c"])

df3 = df1.alias("df1").join(
    df2.alias("df2"),
    F.expr("array_contains(df1.b, df2.b)"),
    "left"
).groupBy("df1.a").agg(
    F.first("df1.b").alias("b"),
    F.collect_list("df2.c").alias("c")
)

df3.show()
#+---+------+-------------+
#|  a|     b|            c|
#+---+------+-------------+
#|  2|[3, 4]|[Three, Four]|
#|  3|   [4]|       [Four]|
#+---+------+-------------+