Pyspark 数据框:从字符串列创建一个新的数字列并计算平均值

Pyspark dataframe: Create a new numeric column from a string column and calculate average

我有一个 pyspark 数据框,就像下面的输入数据一样。主题分数列类型是字符串。我想先将字符串列类型转换为整数列类型 输出 1 中显示了所需的结果。

我希望计算新数字列中的平均值 avg_subject_score(不想对现有列进行类型转换)。所需的新列应如输出 2

中所示

然后,向这个新数据框(包含分组平均值)添加一列(名称为“grade”StringType()),如果平均分数在 50 到 99 之间,则包含字符串“Good”,“Very Good”如果平均分数高于 100,如果平均分数低于 50,则为“失败”。所需结果在输出 3

+------+-------------------+
|ID    |subject_score      |
+------+-------------------+
|123456|100                |
|123456|50                 |
|123456|0                  |
|789292|200                |
|789292|200                |
|789292|100                |
|239000|50                 |
|239000|100                |
|239000|NA                 |
|239000|NA                 |
+------+-------------------+```



Output 1 - without NA 
+------+-------------------+
|ID    |converted_score    |
+------+-------------------+
|123456|100                |
|123456|50                 |
|123456|0                  |
|789292|200                |
|789292|200                |
|789292|100                |
|239000|50                 |
|239000|100                |
+------+-------------------+

Output 2
+------+-------------------+
|ID    |avg_subject_score  |
+------+-------------------+
|123456|50                 |
|789292|167                |
|239000|38                 |
+------+-------------------+


Output 3
+------+-------------------+-------------+
|id    |avg_subject_score  |grade        |
+------+-------------------+-------------+
|123456|50                 |Good         |
|789292|167                |Very Good    |
|239000|38                 |Fail         |
+------+-------------------+-------------+

这看起来更像是一个家庭作业,因为您应该尝试自己探索基本功能

但只是为了它 首先让我们创建具有所需数据类型的新列

dfv=df2.withColumn("converted_score",col("subject_score").cast("long")).drop("subject_score").dropna(how="all",subset=["converted_score"])

现在创建组和平均

dfv=dfv.groupBy("ID").agg(avg("converted_score").alias("avg_subject_score"))

现在您可以使用何时添加最后一列

dfv=dfv.withColumn("grade",when(col("avg_subject_score")>100,"very good").when(col("avg_subject_score")>50,"good").otherwise("fail"))