Spark:根据日期组获取第一个条目
Spark: getting the first entry according to a date groupBy
是否可以从某个数据帧中获取每天的第一个 Datetime
?
架构:
root
|-- Datetime: timestamp (nullable = true)
|-- Quantity: integer (nullable = true)
+-------------------+--------+
| Datetime|Quantity|
+-------------------+--------+
|2021-09-10 10:08:11| 200|
|2021-09-10 10:08:16| 100|
|2021-09-11 10:05:11| 100|
|2021-09-11 10:07:25| 100|
|2021-09-11 10:07:14| 3000|
|2021-09-12 09:24:11| 1000|
+-------------------+--------+
期望的输出:
+-------------------+--------+
| Datetime|Quantity|
+-------------------+--------+
|2021-09-10 10:08:11| 200|
|2021-09-11 10:05:11| 100|
|2021-09-12 09:24:11| 1000|
+-------------------+--------+
您可以使用 row_number
。只需定义一个 Window 按天分区并按 Datetime
:
排序
from pyspark.sql import functions as F, Window
w = Window.partitionBy(F.to_date("Datetime")).orderBy("Datetime")
df1 = df.withColumn("rn", F.row_number().over(w)).filter("rn = 1").drop("rn")
df1.show()
#+-------------------+--------+
#| Datetime|Quantity|
#+-------------------+--------+
#|2021-09-10 10:08:11| 200|
#|2021-09-11 10:05:11| 100|
#|2021-09-12 09:24:11| 1000|
#+-------------------+--------+
是否可以从某个数据帧中获取每天的第一个 Datetime
?
架构:
root
|-- Datetime: timestamp (nullable = true)
|-- Quantity: integer (nullable = true)
+-------------------+--------+
| Datetime|Quantity|
+-------------------+--------+
|2021-09-10 10:08:11| 200|
|2021-09-10 10:08:16| 100|
|2021-09-11 10:05:11| 100|
|2021-09-11 10:07:25| 100|
|2021-09-11 10:07:14| 3000|
|2021-09-12 09:24:11| 1000|
+-------------------+--------+
期望的输出:
+-------------------+--------+
| Datetime|Quantity|
+-------------------+--------+
|2021-09-10 10:08:11| 200|
|2021-09-11 10:05:11| 100|
|2021-09-12 09:24:11| 1000|
+-------------------+--------+
您可以使用 row_number
。只需定义一个 Window 按天分区并按 Datetime
:
from pyspark.sql import functions as F, Window
w = Window.partitionBy(F.to_date("Datetime")).orderBy("Datetime")
df1 = df.withColumn("rn", F.row_number().over(w)).filter("rn = 1").drop("rn")
df1.show()
#+-------------------+--------+
#| Datetime|Quantity|
#+-------------------+--------+
#|2021-09-10 10:08:11| 200|
#|2021-09-11 10:05:11| 100|
#|2021-09-12 09:24:11| 1000|
#+-------------------+--------+