如何计算每行给定索引前后的行平均值-pyspark?
how to calculate row mean before and after a given index for each row - pyspark?
我有一个包含多列的数据框和一个索引,我必须计算索引前后这些列的平均值。
这是我的 pandas 代码:
for i in range(len(res.index)):
i=int(i)
m=int(res['index'].ix[i])
n = len(res.columns[1:m])
if n == 0:
res['mean'].ix[i]=0
else:
res['mean'].ix[i]=int(res.ix[i,1:m].sum()) / n
我想在 pyspark 中完成?
请帮忙!!
您可以使用 pyspark
中的 UDF
进行计算。这是一个例子:-
from pyspark.sql import functions as F
from pyspark.sql import types as T
import numpy as np
sample_data = sqlContext.createDataFrame([
range(10)+[4],
range(50, 60)+[2],
range(9, 19)+[4],
range(19, 29)+[3],
], ["col_"+str(i) for i in range(10)]+["index"])
sample_data.show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|col_0|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|index|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9| 4|
| 50| 51| 52| 53| 54| 55| 56| 57| 58| 59| 2|
| 9| 10| 11| 12| 13| 14| 15| 16| 17| 18| 4|
| 19| 20| 21| 22| 23| 24| 25| 26| 27| 28| 3|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
def def_mn(data, index, mean="pre"):
if mean == "pre":
return sum(data[:index])/float(len(data[:index]))
elif mean == "post":
return sum(data[index:])/float(len(data[index:]))
mn_udf = F.udf(def_mn)
sample_data.withColumn(
"index_pre_mean",
mn_udf(F.array([cl for cl in sample_data.columns[:-1]]), "index")
).withColumn(
"index_post_mean",
mn_udf(F.array([cl for cl in sample_data.columns[:-1]]), "index", F.lit("post"))
).show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--------------+---------------+
|col_0|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|index|index_pre_mean|index_post_mean|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--------------+---------------+
|0 |1 |2 |3 |4 |5 |6 |7 |8 |9 |4 |1.5 |6.5 |
|50 |51 |52 |53 |54 |55 |56 |57 |58 |59 |2 |50.5 |55.5 |
|9 |10 |11 |12 |13 |14 |15 |16 |17 |18 |4 |10.5 |15.5 |
|19 |20 |21 |22 |23 |24 |25 |26 |27 |28 |3 |20.0 |25.0 |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--------------+---------------+
我有一个包含多列的数据框和一个索引,我必须计算索引前后这些列的平均值。
这是我的 pandas 代码:
for i in range(len(res.index)):
i=int(i)
m=int(res['index'].ix[i])
n = len(res.columns[1:m])
if n == 0:
res['mean'].ix[i]=0
else:
res['mean'].ix[i]=int(res.ix[i,1:m].sum()) / n
我想在 pyspark 中完成? 请帮忙!!
您可以使用 pyspark
中的 UDF
进行计算。这是一个例子:-
from pyspark.sql import functions as F
from pyspark.sql import types as T
import numpy as np
sample_data = sqlContext.createDataFrame([
range(10)+[4],
range(50, 60)+[2],
range(9, 19)+[4],
range(19, 29)+[3],
], ["col_"+str(i) for i in range(10)]+["index"])
sample_data.show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|col_0|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|index|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9| 4|
| 50| 51| 52| 53| 54| 55| 56| 57| 58| 59| 2|
| 9| 10| 11| 12| 13| 14| 15| 16| 17| 18| 4|
| 19| 20| 21| 22| 23| 24| 25| 26| 27| 28| 3|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
def def_mn(data, index, mean="pre"):
if mean == "pre":
return sum(data[:index])/float(len(data[:index]))
elif mean == "post":
return sum(data[index:])/float(len(data[index:]))
mn_udf = F.udf(def_mn)
sample_data.withColumn(
"index_pre_mean",
mn_udf(F.array([cl for cl in sample_data.columns[:-1]]), "index")
).withColumn(
"index_post_mean",
mn_udf(F.array([cl for cl in sample_data.columns[:-1]]), "index", F.lit("post"))
).show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--------------+---------------+
|col_0|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|index|index_pre_mean|index_post_mean|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--------------+---------------+
|0 |1 |2 |3 |4 |5 |6 |7 |8 |9 |4 |1.5 |6.5 |
|50 |51 |52 |53 |54 |55 |56 |57 |58 |59 |2 |50.5 |55.5 |
|9 |10 |11 |12 |13 |14 |15 |16 |17 |18 |4 |10.5 |15.5 |
|19 |20 |21 |22 |23 |24 |25 |26 |27 |28 |3 |20.0 |25.0 |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--------------+---------------+