如何在 PySpark 中获取开始和结束日期?

How to get Start and End date in PySpark?

我下面有一个 Spark 数据框(articleDF1),我正在尝试使用日期列向数据框添加两列开始和结束日期,并按 post_evar10 对结果数据框进行分组。 最终的 Dataframe 将具有 post_evar10、开始日期和结束日期

 -------+--------------------+
|      Date|         post_evar10|
+----------+--------------------+
|2019-09-02|www:/espanol/recu...|
|2019-09-02|www:/caregiving/h...|
|2019-12-15|www:/health/condi...|
|2019-09-01|www:/caregiving/h...|
|2019-08-31|www:/travel/trave...|
|2020-01-20|www:/home-family/...|

我试过的:

from pyspark.sql import functions as f
articleDF3 = articleDF1.withColumn('Start_Date', f.min(f.col('Date'))).withColumn('Start_Date', f.max(f.col('Date'))).groupBy(f.col("post_evar10")).drop("Date")

获取错误: org.apache.spark.sql.AnalysisException: 分组表达式序列为空,'temp.ms_article_lifespan_final.Date'不是聚合函数。在窗口函数中包装 '(min(temp.ms_article_lifespan_final.Date) AS Start_Date)' 或在 first() 中包装 'temp.ms_article_lifespan_final.Date' (或 first_value)如果您不关心获得的值。;;

这是您预期的结果吗?

要获得每行的最小值,最大值我们可以使用 window 函数并获得 min,max 然后分组并在聚合中得到最小值,最大值价值观!

Example:

import sys
from pyspark.sql.window import Window
from pyspark.sql.functions import *

#Sample data
df=sc.parallelize([('2019-09-02','www:/espanol/r'),('2019-09-02','www:/caregiving/h'),('2019-12-15','www:/health/condi')]).toDF(['Date','post_evar10']).withColumn("Date",col("Date").cast("Date"))

#window on all rows
w = Window.orderBy("Date").rowsBetween(-sys.maxsize, sys.maxsize)
#or
w = Window.orderBy("Date").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

df.withColumn("min_Date",min("Date").over(w)).\ #get min value for Date
withColumn("max_Date",max("Date").over(w)).\ #get max value for Date
groupBy("post_evar10").\ #groupby on post_evar10
agg(min("min_Date").alias("Start_date"),max("max_Date").alias("End_date")).\ #get min,max
show()

#+-----------------+----------+----------+
#|      post_evar10|Start_date|  End_date|
#+-----------------+----------+----------+
#|   www:/espanol/r|2019-09-02|2019-12-15|
#|www:/caregiving/h|2019-09-02|2019-12-15|
#|www:/health/condi|2019-09-02|2019-12-15|
#+-----------------+----------+----------+

(或)

By using first,last functions over window:

df.withColumn("min_Date",first("Date").over(w)).\
withColumn("max_Date",last("Date").over(w)).\
groupBy("post_evar10").\
agg(min("min_Date").alias("Start_date"),max("max_Date").alias("End_date")).\
show()

Generate min,max for each post_evar10 unique value:

w = Window.partitionBy('post_evar10').orderBy("Date").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

df=sc.parallelize([('2019-09-02','www:/espanol/r'),('2019-09-02','www:/caregiving/h'),('2019-09-03','www:/caregiving/h'),('2019-12-15','www:/health/condi')]).toDF(['Date','post_evar10']).withColumn("Date",col("Date").cast("Date"))

df.groupBy("post_evar10").\
agg(min("Date").alias("Start_date"),max("Date").alias("End_date")).\
show()

#+-----------------+----------+----------+
#|      post_evar10|Start_date|  End_date|
#+-----------------+----------+----------+
#|www:/health/condi|2019-12-15|2019-12-15|
#|   www:/espanol/r|2019-09-02|2019-09-02|
#|www:/caregiving/h|2019-09-02|2019-09-03|
#+-----------------+----------+----------+