cannot resolve '`Column_Name`' given input columns: Error: Pyspark Dataframes

cannot resolve '`Column_Name`' given input columns: Error: Pyspark Dataframes

有人可以帮助我如何在 Pyspark Dataframe 中实现 sql 下面的网络。

    (SUM(Cash) /SUM(cash + credit)) * 100 AS Percentage,
        
    df1=df.withColumn("cash_credit",sf.col("cash") + sf.col("credit")) 
    df1.show(5)

    -------------+---------------+ +--------+-------+------+------| 
    Credit        |Cash   |       MTH|YR           |  cash_credit | 
    -------------+---------------+ -------+--------|--------------|
     100.00|       400.00|         10|       2019  |  500.00      | 
     0.00  |       500.00|         6 |       2019  |  500.00      |  
     200.00|       600.00|         12|       2018  |  800.00      | 
     0.00  |       0.00  |         10|       2019  |  0.00        | 
     300.00|       700.00|          7|       2019  |  1000.00     | 
    -------------+---------------+----------+--------+-------+--- | 

我试过下面的 Pyspark 代码。

    df2 = df1.groupBy('MTH', 'YR').agg(sf.sum("Cash").alias("sum_Cash"))\
             .withColumn("final_column",sf.col("sum_Cash") + sf.col("cash_credit"))\
             .withColumn("div",sf.col("sum_Cash")/sf.col("final_column"))\
             .withColumn("Percentage",sf.col("div")*100)

但是无法执行。显示以下错误。

    cannot resolve '`cash_credit`' given input columns: [MTH, YR, sum_Cash];;

你可以像这样修改它以从 groupby-aggregation 中带出 cash_credit:

df2 = df1.groupBy('MTH', 'YR').agg(sf.sum("Cash").alias("sum_Cash"),sf.sum("cash_credit").alias("cash_credit"))\
         .withColumn("final_column",sf.col("sum_Cash") + sf.col("cash_credit"))\
         .withColumn("div",sf.col("sum_Cash")/sf.col("final_column"))\
         .withColumn("Percentage",sf.col("div")*100)

我对 'cash_credit' 使用求和聚合,但您可以使用其他聚合函数。