订购 Pyspark 时缺少数据 Window

Question

这是我当前的数据集：

from pyspark.sql import Window
import pyspark.sql.functions as psf

df = spark.createDataFrame([("2","1",1),
                            ("3","2",2)],
                     schema = StructType([StructField("Data",  StringType()),
                                          StructField("Source",StringType()),
                                          StructField("Date",  IntegerType())]))


display(df.withColumn("Result",psf.collect_set("Data").over(Window.partitionBy("Source").orderBy("Date"))))

输出：

Data	Source	Date	Result
2	1	1	["2"]
3	1	2	["2","3"]

为什么我在第 Result 列的第一行缺少值 3 时，在 Window 上使用 collect_set 函数，即 ordered？

我也尝试使用 collect_list，但得到的结果相同。

我想要的输出是：

Data	Source	Date	Result
2	1	1	["2","3"]
3	1	2	["2","3"]

保留 Result 中值的顺序 - 第一个是 Date = 1，第二个是 Date = 2

Answer 1

您需要将 Window 与 unboundedPreceding 和 Window.unboundedFollowing 一起使用：

Window.partitionBy("Source").orderBy("Date") \
  .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

默认情况下，当您有 orderBy

时，Spark 使用 rowsBetween(Window.unboundedPreceding, Window.currentRow)

订购 Pyspark 时缺少数据 Window

Missing data when ordering Pyspark Window

apache-spark

pyspark

apache-spark-sql