如何使用pyspark从按某些字段分组的给定数据集中获取最大值(日期)?
how to get max(date) from given set of data grouped by some fields using pyspark?
我在数据框中有如下数据:
datetime | userId | memberId | value |
2016-04-06 16:36:... | 1234 | 111 | 1
2016-04-06 17:35:... | 1234 | 222 | 5
2016-04-06 17:50:... | 1234 | 111 | 8
2016-04-06 18:36:... | 1234 | 222 | 9
2016-04-05 16:36:... | 4567 | 111 | 1
2016-04-06 17:35:... | 4567 | 222 | 5
2016-04-06 18:50:... | 4567 | 111 | 8
2016-04-06 19:36:... | 4567 | 222 | 9
我需要找到 max(datetime) groupby userid,memberid。当我尝试如下时:
df2 = df.groupBy('userId','memberId').max('datetime')
我遇到的错误是:
org.apache.spark.sql.AnalysisException: "datetime" is not a numeric
column. Aggregation function can only be applied on a numeric column.;
我想要的输出如下:
userId | memberId | datetime
1234 | 111 | 2016-04-06 17:50:...
1234 | 222 | 2016-04-06 18:36:...
4567 | 111 | 2016-04-06 18:50:...
4567 | 222 | 2016-04-06 19:36:...
有人可以帮助我如何使用 PySpark 数据帧获得给定数据中的最大日期吗?
对于非数字但 Orderable
类型,您可以直接将 agg
与 max
一起使用:
from pyspark.sql.functions import col, max as max_
df = sc.parallelize([
("2016-04-06 16:36", 1234, 111, 1),
("2016-04-06 17:35", 1234, 111, 5),
]).toDF(["datetime", "userId", "memberId", "value"])
(df.withColumn("datetime", col("datetime").cast("timestamp"))
.groupBy("userId", "memberId")
.agg(max_("datetime")))
## +------+--------+--------------------+
## |userId|memberId| max(datetime)|
## +------+--------+--------------------+
## | 1234| 111|2016-04-06 17:35:...|
## +------+--------+--------------------+
我在数据框中有如下数据:
datetime | userId | memberId | value |
2016-04-06 16:36:... | 1234 | 111 | 1
2016-04-06 17:35:... | 1234 | 222 | 5
2016-04-06 17:50:... | 1234 | 111 | 8
2016-04-06 18:36:... | 1234 | 222 | 9
2016-04-05 16:36:... | 4567 | 111 | 1
2016-04-06 17:35:... | 4567 | 222 | 5
2016-04-06 18:50:... | 4567 | 111 | 8
2016-04-06 19:36:... | 4567 | 222 | 9
我需要找到 max(datetime) groupby userid,memberid。当我尝试如下时:
df2 = df.groupBy('userId','memberId').max('datetime')
我遇到的错误是:
org.apache.spark.sql.AnalysisException: "datetime" is not a numeric
column. Aggregation function can only be applied on a numeric column.;
我想要的输出如下:
userId | memberId | datetime
1234 | 111 | 2016-04-06 17:50:...
1234 | 222 | 2016-04-06 18:36:...
4567 | 111 | 2016-04-06 18:50:...
4567 | 222 | 2016-04-06 19:36:...
有人可以帮助我如何使用 PySpark 数据帧获得给定数据中的最大日期吗?
对于非数字但 Orderable
类型,您可以直接将 agg
与 max
一起使用:
from pyspark.sql.functions import col, max as max_
df = sc.parallelize([
("2016-04-06 16:36", 1234, 111, 1),
("2016-04-06 17:35", 1234, 111, 5),
]).toDF(["datetime", "userId", "memberId", "value"])
(df.withColumn("datetime", col("datetime").cast("timestamp"))
.groupBy("userId", "memberId")
.agg(max_("datetime")))
## +------+--------+--------------------+
## |userId|memberId| max(datetime)|
## +------+--------+--------------------+
## | 1234| 111|2016-04-06 17:35:...|
## +------+--------+--------------------+