运行 Pyspark 下限和上限的总和/累计总和
running sum / cumulative sum with floor and ceiling Py Spark
我是 spark 的新手,我正在尝试计算一个 window 运行 以 0 为底,以 8 为上限的总和
下面给出了一个玩具示例(注意实际数据接近百万行):
import pyspark.sql.functions as F
from pyspark.sql import Window
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
pdf = pd.DataFrame({'ids': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'day': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'counts': [-3, 3, -6, 3, 3, 6, -3, -6, 3, 3, 3, -3]})
sdf = spark.createDataFrame(pdf)
sdf = sdf.orderBy(sdf.ids,sdf.day)
这将创建 table
+----+---+-------+
|aIds|day|eCounts|
+----+---+-------+
| 1| 1| -3|
| 1| 2| 3|
| 1| 3| -6|
| 1| 4| 3|
| 2| 1| 3|
| 2| 2| 6|
| 2| 3| -3|
| 2| 4| -6|
| 3| 1| 3|
| 3| 2| 3|
| 3| 3| 3|
| 3| 4| -3|
+----+---+-------+
下面是执行 运行 求和的结果示例,以及 预期输出 runSumCap
+----+---+-------+------+---------+
|aIds|day|eCounts|runSum|runSumCap|
+----+---+-------+------+---------+
| 1| 1| -3| -3| 0| <-- reset to 0
| 1| 2| 3| 0| 3|
| 1| 3| -6| -6| 0| <-- reset to 0
| 1| 4| 3| -3| 3|
| 2| 1| 3| 3| 3|
| 2| 2| 6| 9| 8| <-- reset to 8
| 2| 3| -3| 6| 5|
| 2| 4| -6| 0| 0| <-- reset to 0
| 3| 1| 3| 3| 3|
| 3| 2| 3| 6| 6|
| 3| 3| 3| 9| 8| <-- reset to 8
| 3| 4| -3| 6| 5|
+----+---+-------+------+---------+
我知道我可以计算 运行 总和
partition = Window.partitionBy('aIds').orderBy('aIds','day').rowsBetween(Window.unboundedPreceding, Window.currentRow)`
sdf1 = sdf.withColumn('runSum',F.sum(sdf.eCounts).over(partition))
sdf1.orderBy('aIds','day').show()
为了达到预期效果,我尝试查看@pandas_udf 来修改总和:
@pandas_udf('double', PandasUDFType.GROUPED_AGG)
def runSumCap(counts):
#counts columns is passed as a pandas series
floor = 0
cap = 8
runSum = 0
runSumList = []
for count in counts.tolist():
runSum = runSum + count
if(runSum > cap):
runSum = 8
elif(runSum < floor ):
runSum = 0
runSumList += [runSum]
return pd.Series(runSumList)
partition = Window.partitionBy('aIds').orderBy('aIds','day').rowsBetween(Window.unboundedPreceding, Window.currentRow)
sdf1 = sdf.withColumn('runSum',runSumCap(sdf['counts']).over(partition))
但是这不起作用,而且这似乎不是最有效的方法。
我怎样才能使这项工作?有没有办法让它保持平行,或者我必须去 pandas dataframes
编辑:
对现有列进行了一些澄清,以对数据集进行排序,并对我正在努力实现的目标有了更多见解
编辑2:
@DrChess 提供的答案几乎产生了正确的结果,但由于某种原因该系列与正确的日期不匹配:
+----+---+-------+------+
|aIds|day|eCounts|runSum|
+----+---+-------+------+
| 1| 1| -3| 0|
| 1| 2| 3| 0|
| 1| 3| -6| 3|
| 1| 4| 3| 3|
| 2| 1| 3| 3|
| 2| 2| 6| 8|
| 2| 3| -3| 0|
| 2| 4| -6| 5|
| 3| 1| 3| 6|
| 3| 2| 3| 3|
| 3| 3| 3| 8|
| 3| 4| -3| 5|
+----+---+-------+------+
不幸的是,具有 pandas_udf
类型 GROUPED_AGG
的 window 函数不适用于有界 window 函数 (.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)。它目前仅适用于无界 windows,即 .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
。此外,输入是 pandas.Series
但输出应该是所提供类型的常量。因此,您将无法实现部分聚合。
相反,您可以使用 GROUPED_MAP
pandas_udf
,它与 df.groupBy().apply()
一起使用。
这里有一些代码:
@pandas_udf('ids integer, day integer, counts integer, runSum integer', PandasUDFType.GROUPED_MAP)
def runSumCap(pdf):
def _apply_on_series(counts):
floor = 0
cap = 8
runSum = 0
runSumList = []
for count in counts.tolist():
runSum = runSum + count
if(runSum > cap):
runSum = 8
elif(runSum < floor ):
runSum = 0
runSumList += [runSum]
return pd.Series(runSumList)
pdf.sort_values(by=['day'], inplace=True)
pdf['runSum'] = _apply_on_series(pdf['counts'])
return pdf
sdf1 = sdf.groupBy('ids').apply(runSumCap)
我找到了一种方法,首先在每一行中创建一个数组(使用 collect_list 作为 window 函数),其中包含用于计算 运行 总和的值直到那一点。
然后我定义了一个 udf(无法使用 pandas_udf 使它工作)并且它起作用了。
以下是完整的可重现示例:
import pyspark.sql.functions as F
from pyspark.sql import Window
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import numpy as np
def accumalate(iterable):
total = 0
ceil = 8
floor = 0
for element in iterable:
total = total + element
if (total > ceil):
total = ceil
elif (total < floor):
total = floor
return total
pdf = pd.DataFrame({'aIds': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'day': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'eCounts': [-3, 3, -6, 3, 3, 6, -3, -6, 3, 3, 3, -3]})
sdf = spark.createDataFrame(pdf)
sdf = sdf.orderBy(sdf.aIds,sdf.day)
runSumCap = F.udf(accumalate,LongType())
partition = Window.partitionBy('aIds').orderBy('aIds','day').rowsBetween(Window.unboundedPreceding, Window.currentRow)
sdf1 = sdf.withColumn('splitWindow',F.collect_list(sdf.eCounts).over(partition))
sdf2 = sdf1.withColumn('runSumCap',runSumCap(sdf1.splitWindow))
sdf2.orderBy('aIds','day').show()
这会产生预期的结果:
+----+---+-------+--------------+---------+
|aIds|day|eCounts| splitWindow|runSumCap|
+----+---+-------+--------------+---------+
| 1| 1| -3| [-3]| 0|
| 1| 2| 3| [-3, 3]| 3|
| 1| 3| -6| [-3, 3, -6]| 0|
| 1| 4| 3|[-3, 3, -6, 3]| 3|
| 2| 1| 3| [3]| 3|
| 2| 2| 6| [3, 6]| 8|
| 2| 3| -3| [3, 6, -3]| 5|
| 2| 4| -6|[3, 6, -3, -6]| 0|
| 3| 1| 3| [3]| 3|
| 3| 2| 3| [3, 3]| 6|
| 3| 3| 3| [3, 3, 3]| 8|
| 3| 4| -3| [3, 3, 3, -3]| 5|
+----+---+-------+--------------+---------+
我是 spark 的新手,我正在尝试计算一个 window 运行 以 0 为底,以 8 为上限的总和
下面给出了一个玩具示例(注意实际数据接近百万行):
import pyspark.sql.functions as F
from pyspark.sql import Window
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
pdf = pd.DataFrame({'ids': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'day': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'counts': [-3, 3, -6, 3, 3, 6, -3, -6, 3, 3, 3, -3]})
sdf = spark.createDataFrame(pdf)
sdf = sdf.orderBy(sdf.ids,sdf.day)
这将创建 table
+----+---+-------+
|aIds|day|eCounts|
+----+---+-------+
| 1| 1| -3|
| 1| 2| 3|
| 1| 3| -6|
| 1| 4| 3|
| 2| 1| 3|
| 2| 2| 6|
| 2| 3| -3|
| 2| 4| -6|
| 3| 1| 3|
| 3| 2| 3|
| 3| 3| 3|
| 3| 4| -3|
+----+---+-------+
下面是执行 运行 求和的结果示例,以及 预期输出 runSumCap
+----+---+-------+------+---------+
|aIds|day|eCounts|runSum|runSumCap|
+----+---+-------+------+---------+
| 1| 1| -3| -3| 0| <-- reset to 0
| 1| 2| 3| 0| 3|
| 1| 3| -6| -6| 0| <-- reset to 0
| 1| 4| 3| -3| 3|
| 2| 1| 3| 3| 3|
| 2| 2| 6| 9| 8| <-- reset to 8
| 2| 3| -3| 6| 5|
| 2| 4| -6| 0| 0| <-- reset to 0
| 3| 1| 3| 3| 3|
| 3| 2| 3| 6| 6|
| 3| 3| 3| 9| 8| <-- reset to 8
| 3| 4| -3| 6| 5|
+----+---+-------+------+---------+
我知道我可以计算 运行 总和
partition = Window.partitionBy('aIds').orderBy('aIds','day').rowsBetween(Window.unboundedPreceding, Window.currentRow)`
sdf1 = sdf.withColumn('runSum',F.sum(sdf.eCounts).over(partition))
sdf1.orderBy('aIds','day').show()
为了达到预期效果,我尝试查看@pandas_udf 来修改总和:
@pandas_udf('double', PandasUDFType.GROUPED_AGG)
def runSumCap(counts):
#counts columns is passed as a pandas series
floor = 0
cap = 8
runSum = 0
runSumList = []
for count in counts.tolist():
runSum = runSum + count
if(runSum > cap):
runSum = 8
elif(runSum < floor ):
runSum = 0
runSumList += [runSum]
return pd.Series(runSumList)
partition = Window.partitionBy('aIds').orderBy('aIds','day').rowsBetween(Window.unboundedPreceding, Window.currentRow)
sdf1 = sdf.withColumn('runSum',runSumCap(sdf['counts']).over(partition))
但是这不起作用,而且这似乎不是最有效的方法。 我怎样才能使这项工作?有没有办法让它保持平行,或者我必须去 pandas dataframes
编辑: 对现有列进行了一些澄清,以对数据集进行排序,并对我正在努力实现的目标有了更多见解
编辑2: @DrChess 提供的答案几乎产生了正确的结果,但由于某种原因该系列与正确的日期不匹配:
+----+---+-------+------+
|aIds|day|eCounts|runSum|
+----+---+-------+------+
| 1| 1| -3| 0|
| 1| 2| 3| 0|
| 1| 3| -6| 3|
| 1| 4| 3| 3|
| 2| 1| 3| 3|
| 2| 2| 6| 8|
| 2| 3| -3| 0|
| 2| 4| -6| 5|
| 3| 1| 3| 6|
| 3| 2| 3| 3|
| 3| 3| 3| 8|
| 3| 4| -3| 5|
+----+---+-------+------+
不幸的是,具有 pandas_udf
类型 GROUPED_AGG
的 window 函数不适用于有界 window 函数 (.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)。它目前仅适用于无界 windows,即 .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
。此外,输入是 pandas.Series
但输出应该是所提供类型的常量。因此,您将无法实现部分聚合。
相反,您可以使用 GROUPED_MAP
pandas_udf
,它与 df.groupBy().apply()
一起使用。
这里有一些代码:
@pandas_udf('ids integer, day integer, counts integer, runSum integer', PandasUDFType.GROUPED_MAP)
def runSumCap(pdf):
def _apply_on_series(counts):
floor = 0
cap = 8
runSum = 0
runSumList = []
for count in counts.tolist():
runSum = runSum + count
if(runSum > cap):
runSum = 8
elif(runSum < floor ):
runSum = 0
runSumList += [runSum]
return pd.Series(runSumList)
pdf.sort_values(by=['day'], inplace=True)
pdf['runSum'] = _apply_on_series(pdf['counts'])
return pdf
sdf1 = sdf.groupBy('ids').apply(runSumCap)
我找到了一种方法,首先在每一行中创建一个数组(使用 collect_list 作为 window 函数),其中包含用于计算 运行 总和的值直到那一点。 然后我定义了一个 udf(无法使用 pandas_udf 使它工作)并且它起作用了。 以下是完整的可重现示例:
import pyspark.sql.functions as F
from pyspark.sql import Window
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import numpy as np
def accumalate(iterable):
total = 0
ceil = 8
floor = 0
for element in iterable:
total = total + element
if (total > ceil):
total = ceil
elif (total < floor):
total = floor
return total
pdf = pd.DataFrame({'aIds': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'day': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'eCounts': [-3, 3, -6, 3, 3, 6, -3, -6, 3, 3, 3, -3]})
sdf = spark.createDataFrame(pdf)
sdf = sdf.orderBy(sdf.aIds,sdf.day)
runSumCap = F.udf(accumalate,LongType())
partition = Window.partitionBy('aIds').orderBy('aIds','day').rowsBetween(Window.unboundedPreceding, Window.currentRow)
sdf1 = sdf.withColumn('splitWindow',F.collect_list(sdf.eCounts).over(partition))
sdf2 = sdf1.withColumn('runSumCap',runSumCap(sdf1.splitWindow))
sdf2.orderBy('aIds','day').show()
这会产生预期的结果:
+----+---+-------+--------------+---------+
|aIds|day|eCounts| splitWindow|runSumCap|
+----+---+-------+--------------+---------+
| 1| 1| -3| [-3]| 0|
| 1| 2| 3| [-3, 3]| 3|
| 1| 3| -6| [-3, 3, -6]| 0|
| 1| 4| 3|[-3, 3, -6, 3]| 3|
| 2| 1| 3| [3]| 3|
| 2| 2| 6| [3, 6]| 8|
| 2| 3| -3| [3, 6, -3]| 5|
| 2| 4| -6|[3, 6, -3, -6]| 0|
| 3| 1| 3| [3]| 3|
| 3| 2| 3| [3, 3]| 6|
| 3| 3| 3| [3, 3, 3]| 8|
| 3| 4| -3| [3, 3, 3, -3]| 5|
+----+---+-------+--------------+---------+