Pyspark 使用计算值创建摘要 table
Pyspark create a summary table with calculated values
我有一个如下所示的数据框:
+--------------------+---------------------+-------------+------------+-----+
|tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|total_amount|isDay|
+--------------------+---------------------+-------------+------------+-----+
| 2019-01-01 09:01:00| 2019-01-01 08:53:20| 1.5| 2.00| true|
| 2019-01-01 21:59:59| 2019-01-01 21:18:59| 2.6| 5.00|false|
| 2019-01-01 10:01:00| 2019-01-01 08:53:20| 1.5| 2.00| true|
| 2019-01-01 22:59:59| 2019-01-01 21:18:59| 2.6| 5.00|false|
+--------------------+---------------------+-------------+------------+-----+
并且我想创建一个摘要 table,它计算所有夜间旅行和所有白天旅行的 trip_rate
(total_amount
列除以 trip_distance
) .所以最终结果应该是这样的:
+------------+-----------+
| day_night | trip_rate |
+------------+-----------+
|Day | 1.33 |
|Night | 1.92 |
+------------+-----------+
这是我正在尝试做的事情:
df2 = spark.createDataFrame(
[
('2019-01-01 09:01:00','2019-01-01 08:53:20','1.5','2.00','true'),#day
('2019-01-01 21:59:59','2019-01-01 21:18:59','2.6','5.00','false'),#night
('2019-01-01 10:01:00','2019-01-01 08:53:20','1.5','2.00','true'),#day
('2019-01-01 22:59:59','2019-01-01 21:18:59','2.6','5.00','false'),#night
],
['tpep_pickup_datetime','tpep_dropoff_datetime','trip_distance','total_amount','day_night'] # add your columns label here
)
day_trip_rate = df2.where(df2.day_night == 'Day').withColumn("trip_rate",F.sum("total_amount")/F.sum("trip_distance"))
night_trip_rate = df2.where(df2.day_night == 'Night').withColumn("trip_rate",F.sum("total_amount")/F.sum("trip_distance"))
我什至不相信我的处理方式是正确的。我收到了这个错误:(
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: "grouping expressions sequence is empty, and '
tpep_pickup_datetime' is not an aggregate function.
谁能帮我知道如何处理这个以获得摘要table?
from pyspark.sql import functions as F
from pyspark.sql.functions import *
df2.groupBy("day_night").agg(F.round(F.sum("total_amount")/F.sum("trip_distance"),2).alias('trip_rate'))\
.withColumn("day_night", F.when(col("day_night")=="true", "Day").otherwise("Night")).show()
+---------+---------+
|day_night|trip_rate|
+---------+---------+
| Day| 1.33|
| Night| 1.92|
+---------+---------+
不四舍五入:
df2.groupBy("day_night").agg(F.sum("total_amount")/F.sum("trip_distance")).alias('trip_rate')\
.withColumn("day_night", F.when(col("day_night")=="true", "Day").otherwise("Night")).show()
(你在df2
构造代码中有day_night
,但在显示table中有isDay
。我正在考虑字段名称为day_night
在这里。)
我有一个如下所示的数据框:
+--------------------+---------------------+-------------+------------+-----+
|tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|total_amount|isDay|
+--------------------+---------------------+-------------+------------+-----+
| 2019-01-01 09:01:00| 2019-01-01 08:53:20| 1.5| 2.00| true|
| 2019-01-01 21:59:59| 2019-01-01 21:18:59| 2.6| 5.00|false|
| 2019-01-01 10:01:00| 2019-01-01 08:53:20| 1.5| 2.00| true|
| 2019-01-01 22:59:59| 2019-01-01 21:18:59| 2.6| 5.00|false|
+--------------------+---------------------+-------------+------------+-----+
并且我想创建一个摘要 table,它计算所有夜间旅行和所有白天旅行的 trip_rate
(total_amount
列除以 trip_distance
) .所以最终结果应该是这样的:
+------------+-----------+
| day_night | trip_rate |
+------------+-----------+
|Day | 1.33 |
|Night | 1.92 |
+------------+-----------+
这是我正在尝试做的事情:
df2 = spark.createDataFrame(
[
('2019-01-01 09:01:00','2019-01-01 08:53:20','1.5','2.00','true'),#day
('2019-01-01 21:59:59','2019-01-01 21:18:59','2.6','5.00','false'),#night
('2019-01-01 10:01:00','2019-01-01 08:53:20','1.5','2.00','true'),#day
('2019-01-01 22:59:59','2019-01-01 21:18:59','2.6','5.00','false'),#night
],
['tpep_pickup_datetime','tpep_dropoff_datetime','trip_distance','total_amount','day_night'] # add your columns label here
)
day_trip_rate = df2.where(df2.day_night == 'Day').withColumn("trip_rate",F.sum("total_amount")/F.sum("trip_distance"))
night_trip_rate = df2.where(df2.day_night == 'Night').withColumn("trip_rate",F.sum("total_amount")/F.sum("trip_distance"))
我什至不相信我的处理方式是正确的。我收到了这个错误:(
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: "grouping expressions sequence is empty, and '
tpep_pickup_datetime' is not an aggregate function.
谁能帮我知道如何处理这个以获得摘要table?
from pyspark.sql import functions as F
from pyspark.sql.functions import *
df2.groupBy("day_night").agg(F.round(F.sum("total_amount")/F.sum("trip_distance"),2).alias('trip_rate'))\
.withColumn("day_night", F.when(col("day_night")=="true", "Day").otherwise("Night")).show()
+---------+---------+
|day_night|trip_rate|
+---------+---------+
| Day| 1.33|
| Night| 1.92|
+---------+---------+
不四舍五入:
df2.groupBy("day_night").agg(F.sum("total_amount")/F.sum("trip_distance")).alias('trip_rate')\
.withColumn("day_night", F.when(col("day_night")=="true", "Day").otherwise("Night")).show()
(你在df2
构造代码中有day_night
,但在显示table中有isDay
。我正在考虑字段名称为day_night
在这里。)