处理分隔列和非分隔列的组合以获得相应值的新行

Handle the combination of delimited and non delimited columns to get new row of corresponding value

对于创建 PySpark DataFrame 的特定场景,其中两列是 pipe 分隔,一列不是,而是其他两列的聚合(管道分隔):

product quantity revenue
a|b      1|1       3          #this means product 'a' quantity was 1 and 'b' was also 1 and total revenue came out to be 3 for both the products
b|c      3|2       9          #this means product 'b' quantity was 3 and 'c' was also 2 and total revenue came out to be 9 for all 5 products

由于我的 productquantity 列是竖线分隔的,所以我将每个产品及其数量分开,然后将其分解以获得单独的数量,如下所示:

product   quantity
a            1
b            1
b            3
c            2

但是因为我没有收入分隔,所以目前我只是在该列中附加零,但我想要得到的是这样的(收入来自 product 的另一行,硬编码为 no delimiter in rev

product           quantity  revenue
a                    1        0
b                    1        0
no delimiter in rev  0        3
b                    3        0
c                    2        0
no delimiter in rev  0        9

任何关于如何实现它的见解都会有所帮助

您可以 union 将产品列设置为 no delimiter in rev 的原始数据框与这样的展开数据框:

from pyspark.sql.functions import split, col, explode, lit, expr

# create new df with product = 'no delimiter in rev' and quantity= 0
df1 = df.withColumn("product", lit("no delimiter in rev")) \
    .withColumn("quantity", lit(0))

# create another df by exploding the product/quantity structure and revenue=0
df2 = df.withColumn('product', split(col("product"), "\|")) \
    .withColumn('quantity', split(col("quantity"), "\|")) \
    .withColumn("product_quantity", explode(expr("arrays_zip(product, quantity)"))) \
    .selectExpr("product_quantity.*", "0 as revenue")

# union the 2 data frames
df1.union(df2).show()

#+-------------------+--------+-------+
#|            product|quantity|revenue|
#+-------------------+--------+-------+
#|no delimiter in rev|       0|      3|
#|no delimiter in rev|       0|      9|
#|                  a|       1|      0|
#|                  b|       1|      0|
#|                  b|       3|      0|
#|                  c|       2|      0|
#+-------------------+--------+-------+