根据列名划分2个PySpark DataFrames

Question

我有 2 个包含数百列的 DataFrame。 Df1 看起来像这样：

id | col1 | col2 | col3 | ..... 
1     .2     .3     .3
2     .1     .4     .2
....

Df2 看起来像这样，只有 1 行值：

col1 | col2 | col3 | ..... 
.2     .3     .3

我想将 Df1 的每一行除以 Df2，所以我应该得到这样的结果：

id | col1 | col2 | col3 | ..... 
1   .2/.2  .3/.3  .3/.3
2   .1/.2  .4/.3  .2/.3

如果我有数百个列，我该如何在连接期间不特别指定列名的情况下执行此操作？提前致谢！

Answer 1

我得到了 df2 的值并用 df1 压缩了它。然后遍历压缩结构，得到除法值。希望这可以帮助。这是我得到的代码片段和输出。

from pyspark.sql.functions import col
df1 = spark.createDataFrame( [('A',2,4),('B',6,8), ('C',10,12) ],['col1','col2','col3'] )
df2 = spark.createDataFrame( [(2,2)],['div1','div2'] )
df1.show()
df2.show()

lr = df2.rdd.take(1)
for c, v in zip(df1.columns[1:], lr[0]):
    df1 = df1.withColumn(c, col(c)/v)
df1.show()

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A|   2|   4|
|   B|   6|   8|
|   C|  10|  12|
+----+----+----+

+----+----+
|div1|div2|
+----+----+
|   2|   2|
+----+----+

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A| 1.0| 2.0|
|   B| 3.0| 4.0|
|   C| 5.0| 6.0|
+----+----+----+

根据列名划分2个PySpark DataFrames

Divide 2 PySpark DataFrames Based on Column Names

apache-spark

apache-spark-sql

pyspark