PySpark：将所有定义为值的数据框列获取到另一列中

Question

我是 PySpark 的新手，不知道我的代码有什么问题。我有 2 个数据框

df1= 
+---+--------------+
| id|No_of_Question|
+---+--------------+
|  1|            Q1|
|  2|            Q4|
|  3|           Q23|
|...|           ...|
+---+--------------+

df2 = 
+--------------------+---+---+---+---+---+---+
| Q1| Q2| Q3| Q4| Q5|  ...   |Q22|Q23|Q24|Q25|
+--------------------+---+---+---+---+---+---+
|  1|  0|  1|  0|  0|  ...   |  1|  1|  1|  1|
+--------------------+---+---+---+---+---+---+

我想创建一个新数据框，其中 df2 中的所有列都定义为 df1.No_of_Question。

预期结果

df2 = 
+------------+
| Q1| Q4| Q24|
+------------+
|  1|  0|   1|
+------------+

我已经试过了

df2 = df2.select(*F.collect_list(df1.No_of_Question)) #Error: Column is not iterable

或

df2 = df2.select(F.collect_list(df1.No_of_Question)) #Error: Resolved attribute(s) No_of_Question#1791 missing from Q1, Q2...

或

df2 = df2.select(*df1.No_of_Question)

共

df2= df2.select([col for col in df2.columns if col in df1.No_of_Question])

但是 none 这些解决方案有效。你能帮帮我吗？

Answer 1

您可以将 No_of_Question 的值收集到 python 列表中，然后将其传递给 df2.select()。

试试这个：

questions = [
    F.col(r.No_of_Question).alias(r.No_of_Question) 
    for r in df1.select("No_of_Question").collect()
]

df2 = df2.select(*questions)

PySpark：将所有定义为值的数据框列获取到另一列中

PySpark: get all dataframe columns defined as values into another column

dataframe

apache-spark

apache-spark-sql

pyspark