pyspark dataframe transformation by grouping multiple columns独立
pyspark dataframe transformation by grouping multiple columns independently
您好,我希望通过分别对列进行分组并将新列添加到现有数据框来转换数据框。
ID qty year day week
1 2 2016 5 9
2 4 2016 5 9
3 0 2016 5 9
4 4 2016 5 9
5 0 2016 5 9
6 2 2016 5 9
1 0 2016 6 9
2 2 2016 6 9
3 0 2016 6 9
4 0 2016 6 9
5 0 2016 6 9
6 0 2016 6 9
1 0 2016 0 10
2 2 2016 0 10
3 2 2016 0 10
4 2 2016 0 10
5 6 2016 0 10
6 0 2016 0 10
1 0 2016 1 10
2 0 2016 1 10
3 2 2016 1 10
4 0 2016 1 10
5 0 2016 1 10
6 0 2016 1 10
1 0 2016 2 10
2 6 2016 2 10
3 0 2016 2 10
4 4 2016 2 10
5 0 2016 2 10
6 2 2016 2 10
1 0 2016 3 10
2 0 2016 3 10
3 0 2016 3 10
4 4 2016 3 10
5 0 2016 3 10
6 0 2016 3 10
1 0 2016 4 10
2 0 2016 4 10
3 2 2016 4 10
4 4 2016 4 10
5 0 2016 4 10
6 2 2016 4 10
1 4 2016 5 10
2 0 2016 5 10
3 0 2016 5 10
4 8 2016 5 10
5 0 2016 5 10
6 0 2016 5 10
1 0 2016 6 10
2 0 2016 6 10
3 0 2016 6 10
4 6 2016 6 10
5 2 2016 6 10
6 6 2016 6 10
1 0 2020 0 8
2 2 2020 0 8
3 0 2020 0 8
4 0 2020 0 8
5 0 2020 0 8
6 2 2020 0 8
1 0 2020 1 8
2 0 2020 1 8
3 0 2020 1 8
4 0 2020 1 8
5 0 2020 1 8
6 0 2020 1 8
1 0 2020 2 8
2 0 2020 2 8
3 0 2020 2 8
4 0 2020 2 8
5 0 2020 2 8
6 0 2020 2 8
1 0 2020 3 8
2 0 2020 3 8
3 0 2020 3 8
4 0 2020 3 8
5 0 2020 3 8
6 0 2020 3 8
1 0 2020 4 8
2 0 2020 4 8
3 2 2020 4 8
4 0 2020 4 8
5 0 2020 4 8
6 0 2020 4 8
5 2 2020 5 8
6 4 2020 5 8
3 4 2020 6 8
3 4 2020 0 9
我正在尝试将其转换成这样
ID qty year day week total_Sales_year total_sales_by_day total_sales_week
1 2 2016 5 9 78 12 14
2 4 2016 5 9 78 12 14
3 0 2016 5 9 78 12 14
4 4 2016 5 9 78 12 14
5 0 2016 5 9 78 12 14
6 2 2016 5 9 78 12 14
1 0 2016 6 9 78 2 14
2 2 2016 6 9 78 2 14
3 0 2016 6 9 78 2 14
4 0 2016 6 9 78 2 14
5 0 2016 6 9 78 2 14
6 0 2016 6 9 78 2 14
1 0 2016 0 10 78 2 64
2 2 2016 0 10 78 12 64
3 2 2016 0 10 78 12 64
4 2 2016 0 10 78 12 64
5 6 2016 0 10 78 12 64
6 0 2016 0 10 78 12 64
1 0 2016 1 10 78 2 64
2 0 2016 1 10 78 2 64
3 2 2016 1 10 78 2 64
4 0 2016 1 10 78 2 64
5 0 2016 1 10 78 2 64
6 0 2016 1 10 78 2 64
1 0 2016 2 10 78 12 64
2 6 2016 2 10 78 12 64
3 0 2016 2 10 78 12 64
4 4 2016 2 10 78 12 64
5 0 2016 2 10 78 12 64
6 2 2016 2 10 78 12 64
1 0 2016 3 10 78 4 64
2 0 2016 3 10 78 4 64
3 0 2016 3 10 78 4 64
4 4 2016 3 10 78 4 64
5 0 2016 3 10 78 4 64
6 0 2016 3 10 78 4 64
1 0 2016 4 10 78 8 64
2 0 2016 4 10 78 8 64
3 2 2016 4 10 78 8 64
4 4 2016 4 10 78 8 64
5 0 2016 4 10 78 8 64
6 2 2016 4 10 78 8 64
1 4 2016 5 10 78 12 64
2 0 2016 5 10 78 12 64
3 0 2016 5 10 78 12 64
4 8 2016 5 10 78 12 64
5 0 2016 5 10 78 12 64
6 0 2016 5 10 78 12 64
1 0 2016 6 10 78 14 64
2 0 2016 6 10 78 14 64
3 0 2016 6 10 78 14 64
4 6 2016 6 10 78 14 64
5 2 2016 6 10 78 14 64
6 6 2016 6 10 78 14 64
1 0 2020 0 8 20 4 20
2 2 2020 0 8 20 4 20
3 0 2020 0 8 20 4 20
4 0 2020 0 8 20 4 20
5 0 2020 0 8 20 4 20
6 2 2020 0 8 20 4 20
1 0 2020 1 8 20 0 20
2 0 2020 1 8 20 0 20
3 0 2020 1 8 20 0 20
4 0 2020 1 8 20 0 20
5 0 2020 1 8 20 0 20
6 0 2020 1 8 20 0 20
1 0 2020 2 8 20 0 20
2 0 2020 2 8 20 0 20
3 0 2020 2 8 20 0 20
4 0 2020 2 8 20 0 20
5 0 2020 2 8 20 0 20
6 0 2020 2 8 20 0 20
1 0 2020 3 8 20 0 20
2 0 2020 3 8 20 0 20
3 0 2020 3 8 20 0 20
4 0 2020 3 8 20 0 20
5 0 2020 3 8 20 0 20
6 0 2020 3 8 20 0 20
1 0 2020 4 8 20 2 20
2 0 2020 4 8 20 2 20
3 2 2020 4 8 20 2 20
4 0 2020 4 8 20 2 20
5 0 2020 4 8 20 2 20
6 0 2020 4 8 20 2 20
5 2 2020 5 8 20 6 20
6 4 2020 5 8 20 6 20
3 4 2020 6 8 20 4 20
3 4 2020 0 9 20 4 20
我试过这样做,但这不会得到预期的数据帧。
df.groupBy("year").agg("qty")
但这只会给出一列,无法将其与 ID 列联系起来。
您可以将 unbounded Window
与 only partitionBy
子句一起使用,如下所示,或者您可以 指定rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)
。
我已经对你的数据进行了测试,它工作正常。让我知道它是否有效。
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w1=Window().partitionBy("year")
w2=Window().partitionBy("year","week")
w3=Window().partitionBy("year","week","day")
df.withColumn("total_Sales_year", F.sum("qty").over(w1))\
.withColumn("total_sales_by_day", F.sum("qty").over(w3))\
.withColumn("total_sales_week", F.sum("qty").over(w2)).show()
+---+---+----+---+----+----------------+------------------+----------------+
| ID|qty|year|day|week|total_Sales_year|total_sales_by_day|total_sales_week|
+---+---+----+---+----+----------------+------------------+----------------+
| 1| 2|2016| 5| 9| 78| 12| 14|
| 2| 4|2016| 5| 9| 78| 12| 14|
| 3| 0|2016| 5| 9| 78| 12| 14|
| 4| 4|2016| 5| 9| 78| 12| 14|
| 5| 0|2016| 5| 9| 78| 12| 14|
| 6| 2|2016| 5| 9| 78| 12| 14|
| 1| 0|2016| 6| 9| 78| 2| 14|
| 2| 2|2016| 6| 9| 78| 2| 14|
| 3| 0|2016| 6| 9| 78| 2| 14|
| 4| 0|2016| 6| 9| 78| 2| 14|
| 5| 0|2016| 6| 9| 78| 2| 14|
| 6| 0|2016| 6| 9| 78| 2| 14|
| 1| 0|2016| 0| 10| 78| 12| 64|
| 2| 2|2016| 0| 10| 78| 12| 64|
| 3| 2|2016| 0| 10| 78| 12| 64|
| 4| 2|2016| 0| 10| 78| 12| 64|
| 5| 6|2016| 0| 10| 78| 12| 64|
| 6| 0|2016| 0| 10| 78| 12| 64|
| 1| 0|2016| 1| 10| 78| 2| 64|
| 2| 0|2016| 1| 10| 78| 2| 64|
+---+---+----+---+----+----------------+------------------+----------------+
only showing top 20 rows
另外,如果你想要一个列total_sales_year_to_day
,你可以使用这个:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w1=Window().partitionBy("year")
w2=Window().partitionBy("year","week")
w3=Window().partitionBy("year","week","day")
w4=Window().partitionBy("year","week","day").orderBy("day")
w5=Window().partitionBy("year").orderBy("week","day")
df.withColumn("total_Sales_year", F.sum("qty").over(w1))\
.withColumn("total_sales_by_day", F.sum("qty").over(w3))\
.withColumn("total_sales_week", F.sum("qty").over(w2))\
.withColumn("rownum", F.row_number().over(w4))\
.withColumn("newday", F.when(F.col("rownum")!=1, F.lit(0)).otherwise(F.col("total_sales_by_day")))\
.withColumn("total_sales_year_to_day", F.sum("newday").over(w5)).drop("rownum","newday").show()
+---+---+----+---+----+----------------+------------------+----------------+-----------------------+
| ID|qty|year|day|week|total_Sales_year|total_sales_by_day|total_sales_week|total_sales_year_to_day|
+---+---+----+---+----+----------------+------------------+----------------+-----------------------+
| 1| 2|2016| 5| 9| 78| 12| 14| 12|
| 2| 4|2016| 5| 9| 78| 12| 14| 12|
| 3| 0|2016| 5| 9| 78| 12| 14| 12|
| 4| 4|2016| 5| 9| 78| 12| 14| 12|
| 5| 0|2016| 5| 9| 78| 12| 14| 12|
| 6| 2|2016| 5| 9| 78| 12| 14| 12|
| 1| 0|2016| 6| 9| 78| 2| 14| 14|
| 2| 2|2016| 6| 9| 78| 2| 14| 14|
| 3| 0|2016| 6| 9| 78| 2| 14| 14|
| 4| 0|2016| 6| 9| 78| 2| 14| 14|
| 5| 0|2016| 6| 9| 78| 2| 14| 14|
| 6| 0|2016| 6| 9| 78| 2| 14| 14|
| 1| 0|2016| 0| 10| 78| 12| 64| 26|
| 2| 2|2016| 0| 10| 78| 12| 64| 26|
| 3| 2|2016| 0| 10| 78| 12| 64| 26|
| 4| 2|2016| 0| 10| 78| 12| 64| 26|
| 5| 6|2016| 0| 10| 78| 12| 64| 26|
| 6| 0|2016| 0| 10| 78| 12| 64| 26|
| 1| 0|2016| 1| 10| 78| 2| 64| 28|
| 2| 0|2016| 1| 10| 78| 2| 64| 28|
+---+---+----+---+----+----------------+------------------+----------------+-----------------------+
only showing top 20 rows
您好,我希望通过分别对列进行分组并将新列添加到现有数据框来转换数据框。
ID qty year day week
1 2 2016 5 9
2 4 2016 5 9
3 0 2016 5 9
4 4 2016 5 9
5 0 2016 5 9
6 2 2016 5 9
1 0 2016 6 9
2 2 2016 6 9
3 0 2016 6 9
4 0 2016 6 9
5 0 2016 6 9
6 0 2016 6 9
1 0 2016 0 10
2 2 2016 0 10
3 2 2016 0 10
4 2 2016 0 10
5 6 2016 0 10
6 0 2016 0 10
1 0 2016 1 10
2 0 2016 1 10
3 2 2016 1 10
4 0 2016 1 10
5 0 2016 1 10
6 0 2016 1 10
1 0 2016 2 10
2 6 2016 2 10
3 0 2016 2 10
4 4 2016 2 10
5 0 2016 2 10
6 2 2016 2 10
1 0 2016 3 10
2 0 2016 3 10
3 0 2016 3 10
4 4 2016 3 10
5 0 2016 3 10
6 0 2016 3 10
1 0 2016 4 10
2 0 2016 4 10
3 2 2016 4 10
4 4 2016 4 10
5 0 2016 4 10
6 2 2016 4 10
1 4 2016 5 10
2 0 2016 5 10
3 0 2016 5 10
4 8 2016 5 10
5 0 2016 5 10
6 0 2016 5 10
1 0 2016 6 10
2 0 2016 6 10
3 0 2016 6 10
4 6 2016 6 10
5 2 2016 6 10
6 6 2016 6 10
1 0 2020 0 8
2 2 2020 0 8
3 0 2020 0 8
4 0 2020 0 8
5 0 2020 0 8
6 2 2020 0 8
1 0 2020 1 8
2 0 2020 1 8
3 0 2020 1 8
4 0 2020 1 8
5 0 2020 1 8
6 0 2020 1 8
1 0 2020 2 8
2 0 2020 2 8
3 0 2020 2 8
4 0 2020 2 8
5 0 2020 2 8
6 0 2020 2 8
1 0 2020 3 8
2 0 2020 3 8
3 0 2020 3 8
4 0 2020 3 8
5 0 2020 3 8
6 0 2020 3 8
1 0 2020 4 8
2 0 2020 4 8
3 2 2020 4 8
4 0 2020 4 8
5 0 2020 4 8
6 0 2020 4 8
5 2 2020 5 8
6 4 2020 5 8
3 4 2020 6 8
3 4 2020 0 9
我正在尝试将其转换成这样
ID qty year day week total_Sales_year total_sales_by_day total_sales_week
1 2 2016 5 9 78 12 14
2 4 2016 5 9 78 12 14
3 0 2016 5 9 78 12 14
4 4 2016 5 9 78 12 14
5 0 2016 5 9 78 12 14
6 2 2016 5 9 78 12 14
1 0 2016 6 9 78 2 14
2 2 2016 6 9 78 2 14
3 0 2016 6 9 78 2 14
4 0 2016 6 9 78 2 14
5 0 2016 6 9 78 2 14
6 0 2016 6 9 78 2 14
1 0 2016 0 10 78 2 64
2 2 2016 0 10 78 12 64
3 2 2016 0 10 78 12 64
4 2 2016 0 10 78 12 64
5 6 2016 0 10 78 12 64
6 0 2016 0 10 78 12 64
1 0 2016 1 10 78 2 64
2 0 2016 1 10 78 2 64
3 2 2016 1 10 78 2 64
4 0 2016 1 10 78 2 64
5 0 2016 1 10 78 2 64
6 0 2016 1 10 78 2 64
1 0 2016 2 10 78 12 64
2 6 2016 2 10 78 12 64
3 0 2016 2 10 78 12 64
4 4 2016 2 10 78 12 64
5 0 2016 2 10 78 12 64
6 2 2016 2 10 78 12 64
1 0 2016 3 10 78 4 64
2 0 2016 3 10 78 4 64
3 0 2016 3 10 78 4 64
4 4 2016 3 10 78 4 64
5 0 2016 3 10 78 4 64
6 0 2016 3 10 78 4 64
1 0 2016 4 10 78 8 64
2 0 2016 4 10 78 8 64
3 2 2016 4 10 78 8 64
4 4 2016 4 10 78 8 64
5 0 2016 4 10 78 8 64
6 2 2016 4 10 78 8 64
1 4 2016 5 10 78 12 64
2 0 2016 5 10 78 12 64
3 0 2016 5 10 78 12 64
4 8 2016 5 10 78 12 64
5 0 2016 5 10 78 12 64
6 0 2016 5 10 78 12 64
1 0 2016 6 10 78 14 64
2 0 2016 6 10 78 14 64
3 0 2016 6 10 78 14 64
4 6 2016 6 10 78 14 64
5 2 2016 6 10 78 14 64
6 6 2016 6 10 78 14 64
1 0 2020 0 8 20 4 20
2 2 2020 0 8 20 4 20
3 0 2020 0 8 20 4 20
4 0 2020 0 8 20 4 20
5 0 2020 0 8 20 4 20
6 2 2020 0 8 20 4 20
1 0 2020 1 8 20 0 20
2 0 2020 1 8 20 0 20
3 0 2020 1 8 20 0 20
4 0 2020 1 8 20 0 20
5 0 2020 1 8 20 0 20
6 0 2020 1 8 20 0 20
1 0 2020 2 8 20 0 20
2 0 2020 2 8 20 0 20
3 0 2020 2 8 20 0 20
4 0 2020 2 8 20 0 20
5 0 2020 2 8 20 0 20
6 0 2020 2 8 20 0 20
1 0 2020 3 8 20 0 20
2 0 2020 3 8 20 0 20
3 0 2020 3 8 20 0 20
4 0 2020 3 8 20 0 20
5 0 2020 3 8 20 0 20
6 0 2020 3 8 20 0 20
1 0 2020 4 8 20 2 20
2 0 2020 4 8 20 2 20
3 2 2020 4 8 20 2 20
4 0 2020 4 8 20 2 20
5 0 2020 4 8 20 2 20
6 0 2020 4 8 20 2 20
5 2 2020 5 8 20 6 20
6 4 2020 5 8 20 6 20
3 4 2020 6 8 20 4 20
3 4 2020 0 9 20 4 20
我试过这样做,但这不会得到预期的数据帧。
df.groupBy("year").agg("qty")
但这只会给出一列,无法将其与 ID 列联系起来。
您可以将 unbounded Window
与 only partitionBy
子句一起使用,如下所示,或者您可以 指定rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)
。
我已经对你的数据进行了测试,它工作正常。让我知道它是否有效。
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w1=Window().partitionBy("year")
w2=Window().partitionBy("year","week")
w3=Window().partitionBy("year","week","day")
df.withColumn("total_Sales_year", F.sum("qty").over(w1))\
.withColumn("total_sales_by_day", F.sum("qty").over(w3))\
.withColumn("total_sales_week", F.sum("qty").over(w2)).show()
+---+---+----+---+----+----------------+------------------+----------------+
| ID|qty|year|day|week|total_Sales_year|total_sales_by_day|total_sales_week|
+---+---+----+---+----+----------------+------------------+----------------+
| 1| 2|2016| 5| 9| 78| 12| 14|
| 2| 4|2016| 5| 9| 78| 12| 14|
| 3| 0|2016| 5| 9| 78| 12| 14|
| 4| 4|2016| 5| 9| 78| 12| 14|
| 5| 0|2016| 5| 9| 78| 12| 14|
| 6| 2|2016| 5| 9| 78| 12| 14|
| 1| 0|2016| 6| 9| 78| 2| 14|
| 2| 2|2016| 6| 9| 78| 2| 14|
| 3| 0|2016| 6| 9| 78| 2| 14|
| 4| 0|2016| 6| 9| 78| 2| 14|
| 5| 0|2016| 6| 9| 78| 2| 14|
| 6| 0|2016| 6| 9| 78| 2| 14|
| 1| 0|2016| 0| 10| 78| 12| 64|
| 2| 2|2016| 0| 10| 78| 12| 64|
| 3| 2|2016| 0| 10| 78| 12| 64|
| 4| 2|2016| 0| 10| 78| 12| 64|
| 5| 6|2016| 0| 10| 78| 12| 64|
| 6| 0|2016| 0| 10| 78| 12| 64|
| 1| 0|2016| 1| 10| 78| 2| 64|
| 2| 0|2016| 1| 10| 78| 2| 64|
+---+---+----+---+----+----------------+------------------+----------------+
only showing top 20 rows
另外,如果你想要一个列total_sales_year_to_day
,你可以使用这个:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w1=Window().partitionBy("year")
w2=Window().partitionBy("year","week")
w3=Window().partitionBy("year","week","day")
w4=Window().partitionBy("year","week","day").orderBy("day")
w5=Window().partitionBy("year").orderBy("week","day")
df.withColumn("total_Sales_year", F.sum("qty").over(w1))\
.withColumn("total_sales_by_day", F.sum("qty").over(w3))\
.withColumn("total_sales_week", F.sum("qty").over(w2))\
.withColumn("rownum", F.row_number().over(w4))\
.withColumn("newday", F.when(F.col("rownum")!=1, F.lit(0)).otherwise(F.col("total_sales_by_day")))\
.withColumn("total_sales_year_to_day", F.sum("newday").over(w5)).drop("rownum","newday").show()
+---+---+----+---+----+----------------+------------------+----------------+-----------------------+
| ID|qty|year|day|week|total_Sales_year|total_sales_by_day|total_sales_week|total_sales_year_to_day|
+---+---+----+---+----+----------------+------------------+----------------+-----------------------+
| 1| 2|2016| 5| 9| 78| 12| 14| 12|
| 2| 4|2016| 5| 9| 78| 12| 14| 12|
| 3| 0|2016| 5| 9| 78| 12| 14| 12|
| 4| 4|2016| 5| 9| 78| 12| 14| 12|
| 5| 0|2016| 5| 9| 78| 12| 14| 12|
| 6| 2|2016| 5| 9| 78| 12| 14| 12|
| 1| 0|2016| 6| 9| 78| 2| 14| 14|
| 2| 2|2016| 6| 9| 78| 2| 14| 14|
| 3| 0|2016| 6| 9| 78| 2| 14| 14|
| 4| 0|2016| 6| 9| 78| 2| 14| 14|
| 5| 0|2016| 6| 9| 78| 2| 14| 14|
| 6| 0|2016| 6| 9| 78| 2| 14| 14|
| 1| 0|2016| 0| 10| 78| 12| 64| 26|
| 2| 2|2016| 0| 10| 78| 12| 64| 26|
| 3| 2|2016| 0| 10| 78| 12| 64| 26|
| 4| 2|2016| 0| 10| 78| 12| 64| 26|
| 5| 6|2016| 0| 10| 78| 12| 64| 26|
| 6| 0|2016| 0| 10| 78| 12| 64| 26|
| 1| 0|2016| 1| 10| 78| 2| 64| 28|
| 2| 0|2016| 1| 10| 78| 2| 64| 28|
+---+---+----+---+----+----------------+------------------+----------------+-----------------------+
only showing top 20 rows