Pyspark 数据框:从字符串列创建一个新的数字列并计算平均值
Pyspark dataframe: Create a new numeric column from a string column and calculate average
我有一个 pyspark 数据框,就像下面的输入数据一样。主题分数列类型是字符串。我想先将字符串列类型转换为整数列类型
输出 1 中显示了所需的结果。
我希望计算新数字列中的平均值 avg_subject_score(不想对现有列进行类型转换)。所需的新列应如输出 2
中所示
然后,向这个新数据框(包含分组平均值)添加一列(名称为“grade”StringType()),如果平均分数在 50 到 99 之间,则包含字符串“Good”,“Very Good”如果平均分数高于 100,如果平均分数低于 50,则为“失败”。所需结果在输出 3
中
+------+-------------------+
|ID |subject_score |
+------+-------------------+
|123456|100 |
|123456|50 |
|123456|0 |
|789292|200 |
|789292|200 |
|789292|100 |
|239000|50 |
|239000|100 |
|239000|NA |
|239000|NA |
+------+-------------------+```
Output 1 - without NA
+------+-------------------+
|ID |converted_score |
+------+-------------------+
|123456|100 |
|123456|50 |
|123456|0 |
|789292|200 |
|789292|200 |
|789292|100 |
|239000|50 |
|239000|100 |
+------+-------------------+
Output 2
+------+-------------------+
|ID |avg_subject_score |
+------+-------------------+
|123456|50 |
|789292|167 |
|239000|38 |
+------+-------------------+
Output 3
+------+-------------------+-------------+
|id |avg_subject_score |grade |
+------+-------------------+-------------+
|123456|50 |Good |
|789292|167 |Very Good |
|239000|38 |Fail |
+------+-------------------+-------------+
这看起来更像是一个家庭作业,因为您应该尝试自己探索基本功能
但只是为了它
首先让我们创建具有所需数据类型的新列
dfv=df2.withColumn("converted_score",col("subject_score").cast("long")).drop("subject_score").dropna(how="all",subset=["converted_score"])
现在创建组和平均
dfv=dfv.groupBy("ID").agg(avg("converted_score").alias("avg_subject_score"))
现在您可以使用何时添加最后一列
dfv=dfv.withColumn("grade",when(col("avg_subject_score")>100,"very good").when(col("avg_subject_score")>50,"good").otherwise("fail"))
我有一个 pyspark 数据框,就像下面的输入数据一样。主题分数列类型是字符串。我想先将字符串列类型转换为整数列类型 输出 1 中显示了所需的结果。
我希望计算新数字列中的平均值 avg_subject_score(不想对现有列进行类型转换)。所需的新列应如输出 2
中所示然后,向这个新数据框(包含分组平均值)添加一列(名称为“grade”StringType()),如果平均分数在 50 到 99 之间,则包含字符串“Good”,“Very Good”如果平均分数高于 100,如果平均分数低于 50,则为“失败”。所需结果在输出 3
中+------+-------------------+
|ID |subject_score |
+------+-------------------+
|123456|100 |
|123456|50 |
|123456|0 |
|789292|200 |
|789292|200 |
|789292|100 |
|239000|50 |
|239000|100 |
|239000|NA |
|239000|NA |
+------+-------------------+```
Output 1 - without NA
+------+-------------------+
|ID |converted_score |
+------+-------------------+
|123456|100 |
|123456|50 |
|123456|0 |
|789292|200 |
|789292|200 |
|789292|100 |
|239000|50 |
|239000|100 |
+------+-------------------+
Output 2
+------+-------------------+
|ID |avg_subject_score |
+------+-------------------+
|123456|50 |
|789292|167 |
|239000|38 |
+------+-------------------+
Output 3
+------+-------------------+-------------+
|id |avg_subject_score |grade |
+------+-------------------+-------------+
|123456|50 |Good |
|789292|167 |Very Good |
|239000|38 |Fail |
+------+-------------------+-------------+
这看起来更像是一个家庭作业,因为您应该尝试自己探索基本功能
但只是为了它 首先让我们创建具有所需数据类型的新列
dfv=df2.withColumn("converted_score",col("subject_score").cast("long")).drop("subject_score").dropna(how="all",subset=["converted_score"])
现在创建组和平均
dfv=dfv.groupBy("ID").agg(avg("converted_score").alias("avg_subject_score"))
现在您可以使用何时添加最后一列
dfv=dfv.withColumn("grade",when(col("avg_subject_score")>100,"very good").when(col("avg_subject_score")>50,"good").otherwise("fail"))