处理分隔列和非分隔列的组合以获得相应值的新行

Question

对于创建 PySpark DataFrame 的特定场景，其中两列是 pipe 分隔，一列不是，而是其他两列的聚合（管道分隔）：

product quantity revenue
a|b      1|1       3          #this means product 'a' quantity was 1 and 'b' was also 1 and total revenue came out to be 3 for both the products
b|c      3|2       9          #this means product 'b' quantity was 3 and 'c' was also 2 and total revenue came out to be 9 for all 5 products

由于我的 product 和 quantity 列是竖线分隔的，所以我将每个产品及其数量分开，然后将其分解以获得单独的数量，如下所示：

product   quantity
a            1
b            1
b            3
c            2

但是因为我没有收入分隔，所以目前我只是在该列中附加零，但我想要得到的是这样的（收入来自 product 的另一行，硬编码为 no delimiter in rev

product           quantity  revenue
a                    1        0
b                    1        0
no delimiter in rev  0        3
b                    3        0
c                    2        0
no delimiter in rev  0        9

任何关于如何实现它的见解都会有所帮助

Answer 1

您可以 union 将产品列设置为 no delimiter in rev 的原始数据框与这样的展开数据框：

from pyspark.sql.functions import split, col, explode, lit, expr

# create new df with product = 'no delimiter in rev' and quantity= 0
df1 = df.withColumn("product", lit("no delimiter in rev")) \
    .withColumn("quantity", lit(0))

# create another df by exploding the product/quantity structure and revenue=0
df2 = df.withColumn('product', split(col("product"), "\|")) \
    .withColumn('quantity', split(col("quantity"), "\|")) \
    .withColumn("product_quantity", explode(expr("arrays_zip(product, quantity)"))) \
    .selectExpr("product_quantity.*", "0 as revenue")

# union the 2 data frames
df1.union(df2).show()

#+-------------------+--------+-------+
#|            product|quantity|revenue|
#+-------------------+--------+-------+
#|no delimiter in rev|       0|      3|
#|no delimiter in rev|       0|      9|
#|                  a|       1|      0|
#|                  b|       1|      0|
#|                  b|       3|      0|
#|                  c|       2|      0|
#+-------------------+--------+-------+

处理分隔列和非分隔列的组合以获得相应值的新行

Handle the combination of delimited and non delimited columns to get new row of corresponding value

python-3.x

apache-spark

apache-spark-sql

pyspark

pyspark-dataframes