在 PySpark 中加入带有字符串列的列表列
Join list column with string column in PySpark
我有两个数据框 df_emp
和 df_dept
:
df_emp
:
id Name
1 aaa
2 bbb
3 ccc
4 ddd
df_dept
:
dept_id dept_name employees
1 DE [1, 2]
2 DA [3, 4]
加入后的预期结果:
dept_name employees employee_names
DE [1, 2] [aaa, bbb]
DA [3, 4] [ccc, ddd]
知道如何使用简单连接或 udf 来实现吗?
不用UDF也可以。首先explode
数组,然后连接和分组。
输入数据:
from pyspark.sql import functions as F
df_emp = spark.createDataFrame(
[(1, 'aaa'),
(2, 'bbb'),
(3, 'ccc'),
(4, 'ddd')],
['id', 'Name']
)
df_dept = spark.createDataFrame(
[(1, 'DE', [1, 2]),
(2, 'DA', [3, 4])],
['dept_id', 'dept_name', 'employees']
)
脚本:
df_dept_exploded = df_dept.withColumn('id', F.explode('employees'))
df_joined = df_dept_exploded.join(df_emp, 'id', 'left')
df = (
df_joined
.groupBy('dept_name')
.agg(
F.collect_list('id').alias('employees'),
F.collect_list('Name').alias('employee_names')
)
)
df.show()
# +---------+---------+--------------+
# |dept_name|employees|employee_names|
# +---------+---------+--------------+
# | DE| [1, 2]| [aaa, bbb]|
# | DA| [3, 4]| [ccc, ddd]|
# +---------+---------+--------------+
我有两个数据框 df_emp
和 df_dept
:
df_emp
:
id Name
1 aaa
2 bbb
3 ccc
4 ddd
df_dept
:
dept_id dept_name employees
1 DE [1, 2]
2 DA [3, 4]
加入后的预期结果:
dept_name employees employee_names
DE [1, 2] [aaa, bbb]
DA [3, 4] [ccc, ddd]
知道如何使用简单连接或 udf 来实现吗?
不用UDF也可以。首先explode
数组,然后连接和分组。
输入数据:
from pyspark.sql import functions as F
df_emp = spark.createDataFrame(
[(1, 'aaa'),
(2, 'bbb'),
(3, 'ccc'),
(4, 'ddd')],
['id', 'Name']
)
df_dept = spark.createDataFrame(
[(1, 'DE', [1, 2]),
(2, 'DA', [3, 4])],
['dept_id', 'dept_name', 'employees']
)
脚本:
df_dept_exploded = df_dept.withColumn('id', F.explode('employees'))
df_joined = df_dept_exploded.join(df_emp, 'id', 'left')
df = (
df_joined
.groupBy('dept_name')
.agg(
F.collect_list('id').alias('employees'),
F.collect_list('Name').alias('employee_names')
)
)
df.show()
# +---------+---------+--------------+
# |dept_name|employees|employee_names|
# +---------+---------+--------------+
# | DE| [1, 2]| [aaa, bbb]|
# | DA| [3, 4]| [ccc, ddd]|
# +---------+---------+--------------+