应用 Window 函数来计算 pySpark 中的差异
Applying a Window function to calculate differences in pySpark
我正在使用 pySpark
,并使用代表每日资产价格的两列设置我的数据框,如下所示:
ind = sc.parallelize(range(1,5))
prices = sc.parallelize([33.3,31.1,51.2,21.3])
data = ind.zip(prices)
df = sqlCtx.createDataFrame(data,["day","price"])
我申请了 df.show()
:
+---+-----+
|day|price|
+---+-----+
| 1| 33.3|
| 2| 31.1|
| 3| 51.2|
| 4| 21.3|
+---+-----+
这很好。我想要另一列包含价格列的日常 returns,即
(price(day2)-price(day1))/(price(day1))
经过大量研究,我被告知这是通过应用 pyspark.sql.window
函数最有效地实现的,但我看不出如何实现。
您可以使用 lag 函数引入前一天的列,并从两列中添加额外的实际日常 return 列,但您可能必须告诉 spark如何对数据进行分区 and/or 让它做延迟,像这样:
from pyspark.sql.window import Window
import pyspark.sql.functions as func
from pyspark.sql.functions import lit
dfu = df.withColumn('user', lit('tmoore'))
df_lag = dfu.withColumn('prev_day_price',
func.lag(dfu['price'])
.over(Window.partitionBy("user")))
result = df_lag.withColumn('daily_return',
(df_lag['price'] - df_lag['prev_day_price']) / df_lag['price'] )
>>> result.show()
+---+-----+-------+--------------+--------------------+
|day|price| user|prev_day_price| daily_return|
+---+-----+-------+--------------+--------------------+
| 1| 33.3| tmoore| null| null|
| 2| 31.1| tmoore| 33.3|-0.07073954983922816|
| 3| 51.2| tmoore| 31.1| 0.392578125|
| 4| 21.3| tmoore| 51.2| -1.403755868544601|
+---+-----+-------+--------------+--------------------+
这里是对 Window functions in Spark 的更长介绍。
Lag 函数可以帮助您解决用例。
from pyspark.sql.window import Window
import pyspark.sql.functions as func
### Defining the window
Windowspec=Window.orderBy("day")
### Calculating lag of price at each day level
prev_day_price= df.withColumn('prev_day_price',
func.lag(dfu['price'])
.over(Windowspec))
### Calculating the average
result = prev_day_price.withColumn('daily_return',
(prev_day_price['price'] - prev_day_price['prev_day_price']) /
prev_day_price['price'] )
我正在使用 pySpark
,并使用代表每日资产价格的两列设置我的数据框,如下所示:
ind = sc.parallelize(range(1,5))
prices = sc.parallelize([33.3,31.1,51.2,21.3])
data = ind.zip(prices)
df = sqlCtx.createDataFrame(data,["day","price"])
我申请了 df.show()
:
+---+-----+
|day|price|
+---+-----+
| 1| 33.3|
| 2| 31.1|
| 3| 51.2|
| 4| 21.3|
+---+-----+
这很好。我想要另一列包含价格列的日常 returns,即
(price(day2)-price(day1))/(price(day1))
经过大量研究,我被告知这是通过应用 pyspark.sql.window
函数最有效地实现的,但我看不出如何实现。
您可以使用 lag 函数引入前一天的列,并从两列中添加额外的实际日常 return 列,但您可能必须告诉 spark如何对数据进行分区 and/or 让它做延迟,像这样:
from pyspark.sql.window import Window
import pyspark.sql.functions as func
from pyspark.sql.functions import lit
dfu = df.withColumn('user', lit('tmoore'))
df_lag = dfu.withColumn('prev_day_price',
func.lag(dfu['price'])
.over(Window.partitionBy("user")))
result = df_lag.withColumn('daily_return',
(df_lag['price'] - df_lag['prev_day_price']) / df_lag['price'] )
>>> result.show()
+---+-----+-------+--------------+--------------------+
|day|price| user|prev_day_price| daily_return|
+---+-----+-------+--------------+--------------------+
| 1| 33.3| tmoore| null| null|
| 2| 31.1| tmoore| 33.3|-0.07073954983922816|
| 3| 51.2| tmoore| 31.1| 0.392578125|
| 4| 21.3| tmoore| 51.2| -1.403755868544601|
+---+-----+-------+--------------+--------------------+
这里是对 Window functions in Spark 的更长介绍。
Lag 函数可以帮助您解决用例。
from pyspark.sql.window import Window
import pyspark.sql.functions as func
### Defining the window
Windowspec=Window.orderBy("day")
### Calculating lag of price at each day level
prev_day_price= df.withColumn('prev_day_price',
func.lag(dfu['price'])
.over(Windowspec))
### Calculating the average
result = prev_day_price.withColumn('daily_return',
(prev_day_price['price'] - prev_day_price['prev_day_price']) /
prev_day_price['price'] )