从 pyspark 数据框中的数组列中删除结构
Remove struct from Array column in pyspark dataframe
我想从数据框 (pyspark) 中的数组数组(在数组列中)中删除一个数组。
import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
data = [("1", "A", 2), ("1", None, 0), ("1", "B", 3), ("2", None, 0), ("2", "C", 4), ("2", "D", 1), ("2", None, 0)]
dfschema = StructType([StructField("id", StringType()), StructField("value", StringType()), StructField("amount", IntegerType())])
df = spark.createDataFrame(data, schema=dfschema)
grouped = (
df.
groupby("id").
agg(
F.collect_list(
F.struct(
F.col("value"),
F.col("amount")
)
).alias("collected")
)
)
grouped.show(truncate=False)
+---+------------------------------+
|id |collected |
+---+------------------------------+
|1 |[[A, 2], [, 0], [B, 3]] |
|2 |[[, 0], [C, 4], [D, 1], [, 0]]|
+---+------------------------------+
这是我想要的结果
+---+-----------------------+
|id |collected |
+---+-----------------------+
|1 |[[A, 2], [B, 3]] |
|2 |[[C, 4], [D, 1]] |
+---+-----------------------+
我尝试使用 F.array_remove(..., [, 0])
,但出现错误。不太确定如何定义要删除的元素。谢谢!
对于 Spark 2.4+,您可以使用 array_except
:
grouped.withColumn("collected",
array_except(col("collected"),
array(struct(lit(None).cast("string").alias("value"), lit(0).alias("amount")))
)
) \
.show()
给出:
+---+----------------+
|id |collected |
+---+----------------+
|1 |[[A, 2], [B, 3]]|
|2 |[[C, 4], [D, 1]]|
+---+----------------+
我想从数据框 (pyspark) 中的数组数组(在数组列中)中删除一个数组。
import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
data = [("1", "A", 2), ("1", None, 0), ("1", "B", 3), ("2", None, 0), ("2", "C", 4), ("2", "D", 1), ("2", None, 0)]
dfschema = StructType([StructField("id", StringType()), StructField("value", StringType()), StructField("amount", IntegerType())])
df = spark.createDataFrame(data, schema=dfschema)
grouped = (
df.
groupby("id").
agg(
F.collect_list(
F.struct(
F.col("value"),
F.col("amount")
)
).alias("collected")
)
)
grouped.show(truncate=False)
+---+------------------------------+
|id |collected |
+---+------------------------------+
|1 |[[A, 2], [, 0], [B, 3]] |
|2 |[[, 0], [C, 4], [D, 1], [, 0]]|
+---+------------------------------+
这是我想要的结果
+---+-----------------------+
|id |collected |
+---+-----------------------+
|1 |[[A, 2], [B, 3]] |
|2 |[[C, 4], [D, 1]] |
+---+-----------------------+
我尝试使用 F.array_remove(..., [, 0])
,但出现错误。不太确定如何定义要删除的元素。谢谢!
对于 Spark 2.4+,您可以使用 array_except
:
grouped.withColumn("collected",
array_except(col("collected"),
array(struct(lit(None).cast("string").alias("value"), lit(0).alias("amount")))
)
) \
.show()
给出:
+---+----------------+
|id |collected |
+---+----------------+
|1 |[[A, 2], [B, 3]]|
|2 |[[C, 4], [D, 1]]|
+---+----------------+