PySpark 在数据框中按组删除前导零值
PySpark drop leading zero values by group in dataframe
我有这个数据框 -
data = [(0,1,5,5,0,4),
(1,1,5,6,0,7),
(2,1,5,7,1,1),
(3,1,4,8,1,8),
(4,1,5,9,1,1),
(5,1,5,10,1,0),
(6,2,3,4,0,2),
(7,2,3,5,0,6),
(8,2,3,6,3,8),
(9,2,3,7,0,2),
(10,2,3,8,0,6),
(11,2,3,9,6,1)
]
data_cols = ["id","item","store","week","sales","inventory"]
data_df = spark.createDataFrame(data=data, schema = data_)
display(deptDF)
我想要的是对项目、商店和周进行分组,然后删除每组销售额中前导 0 的所有行,就像这样
data_new = [(2,1,5,7,1,1),
(3,1,4,8,1,8),
(4,1,5,9,1,1),
(5,1,5,10,1,0),
(8,2,3,6,3,8),
(9,2,3,7,0,2),
(10,2,3,8,0,6),
(11,2,3,9,6,1)
]
dep_cols = ["id","item","store","week","sales","inventory"]
data_df_new = spark.createDataFrame(data=data_new, schema = dep_cols)
display(data_df_new)
我需要在 PySpark 中执行此操作,而且我是新手。请帮忙!
使用窗口函数,按递增求和或collect_list排序。
- 筛选总和大于 0 的地方
或
2 个过滤器列表,这里有任何大于 0 的值。我更喜欢求和,因为它更快。
w=Window.partitionBy('item','store').orderBy(F.asc('week')).rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn("sums", F.sum('Sales').over(w)).filter(col('sums')>0).drop('sums').show()
+---+----+-----+----+-----+---+
| id|item|store|week|sales|inv|
+---+----+-----+----+-----+---+
| 2| 1| 5| 7| 1| 1|
| 3| 1| 5| 8| 1| 8|
| 4| 1| 5| 9| 1| 1|
| 5| 1| 5| 10| 1| 0|
| 8| 2| 3| 6| 3| 8|
| 9| 2| 3| 7| 0| 2|
| 10| 2| 3| 8| 0| 6|
| 11| 2| 3| 9| 6| 1|
+---+----+-----+----+-----+---+
我有这个数据框 -
data = [(0,1,5,5,0,4),
(1,1,5,6,0,7),
(2,1,5,7,1,1),
(3,1,4,8,1,8),
(4,1,5,9,1,1),
(5,1,5,10,1,0),
(6,2,3,4,0,2),
(7,2,3,5,0,6),
(8,2,3,6,3,8),
(9,2,3,7,0,2),
(10,2,3,8,0,6),
(11,2,3,9,6,1)
]
data_cols = ["id","item","store","week","sales","inventory"]
data_df = spark.createDataFrame(data=data, schema = data_)
display(deptDF)
我想要的是对项目、商店和周进行分组,然后删除每组销售额中前导 0 的所有行,就像这样
data_new = [(2,1,5,7,1,1),
(3,1,4,8,1,8),
(4,1,5,9,1,1),
(5,1,5,10,1,0),
(8,2,3,6,3,8),
(9,2,3,7,0,2),
(10,2,3,8,0,6),
(11,2,3,9,6,1)
]
dep_cols = ["id","item","store","week","sales","inventory"]
data_df_new = spark.createDataFrame(data=data_new, schema = dep_cols)
display(data_df_new)
我需要在 PySpark 中执行此操作,而且我是新手。请帮忙!
使用窗口函数,按递增求和或collect_list排序。
- 筛选总和大于 0 的地方
或
2 个过滤器列表,这里有任何大于 0 的值。我更喜欢求和,因为它更快。
w=Window.partitionBy('item','store').orderBy(F.asc('week')).rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn("sums", F.sum('Sales').over(w)).filter(col('sums')>0).drop('sums').show()
+---+----+-----+----+-----+---+
| id|item|store|week|sales|inv|
+---+----+-----+----+-----+---+
| 2| 1| 5| 7| 1| 1|
| 3| 1| 5| 8| 1| 8|
| 4| 1| 5| 9| 1| 1|
| 5| 1| 5| 10| 1| 0|
| 8| 2| 3| 6| 3| 8|
| 9| 2| 3| 7| 0| 2|
| 10| 2| 3| 8| 0| 6|
| 11| 2| 3| 9| 6| 1|
+---+----+-----+----+-----+---+