处理分隔列和非分隔列的组合以获得相应值的新行
Handle the combination of delimited and non delimited columns to get new row of corresponding value
对于创建 PySpark DataFrame 的特定场景,其中两列是 pipe
分隔,一列不是,而是其他两列的聚合(管道分隔):
product quantity revenue
a|b 1|1 3 #this means product 'a' quantity was 1 and 'b' was also 1 and total revenue came out to be 3 for both the products
b|c 3|2 9 #this means product 'b' quantity was 3 and 'c' was also 2 and total revenue came out to be 9 for all 5 products
由于我的 product
和 quantity
列是竖线分隔的,所以我将每个产品及其数量分开,然后将其分解以获得单独的数量,如下所示:
product quantity
a 1
b 1
b 3
c 2
但是因为我没有收入分隔,所以目前我只是在该列中附加零,但我想要得到的是这样的(收入来自 product
的另一行,硬编码为 no delimiter in rev
product quantity revenue
a 1 0
b 1 0
no delimiter in rev 0 3
b 3 0
c 2 0
no delimiter in rev 0 9
任何关于如何实现它的见解都会有所帮助
您可以 union
将产品列设置为 no delimiter in rev
的原始数据框与这样的展开数据框:
from pyspark.sql.functions import split, col, explode, lit, expr
# create new df with product = 'no delimiter in rev' and quantity= 0
df1 = df.withColumn("product", lit("no delimiter in rev")) \
.withColumn("quantity", lit(0))
# create another df by exploding the product/quantity structure and revenue=0
df2 = df.withColumn('product', split(col("product"), "\|")) \
.withColumn('quantity', split(col("quantity"), "\|")) \
.withColumn("product_quantity", explode(expr("arrays_zip(product, quantity)"))) \
.selectExpr("product_quantity.*", "0 as revenue")
# union the 2 data frames
df1.union(df2).show()
#+-------------------+--------+-------+
#| product|quantity|revenue|
#+-------------------+--------+-------+
#|no delimiter in rev| 0| 3|
#|no delimiter in rev| 0| 9|
#| a| 1| 0|
#| b| 1| 0|
#| b| 3| 0|
#| c| 2| 0|
#+-------------------+--------+-------+
对于创建 PySpark DataFrame 的特定场景,其中两列是 pipe
分隔,一列不是,而是其他两列的聚合(管道分隔):
product quantity revenue
a|b 1|1 3 #this means product 'a' quantity was 1 and 'b' was also 1 and total revenue came out to be 3 for both the products
b|c 3|2 9 #this means product 'b' quantity was 3 and 'c' was also 2 and total revenue came out to be 9 for all 5 products
由于我的 product
和 quantity
列是竖线分隔的,所以我将每个产品及其数量分开,然后将其分解以获得单独的数量,如下所示:
product quantity
a 1
b 1
b 3
c 2
但是因为我没有收入分隔,所以目前我只是在该列中附加零,但我想要得到的是这样的(收入来自 product
的另一行,硬编码为 no delimiter in rev
product quantity revenue
a 1 0
b 1 0
no delimiter in rev 0 3
b 3 0
c 2 0
no delimiter in rev 0 9
任何关于如何实现它的见解都会有所帮助
您可以 union
将产品列设置为 no delimiter in rev
的原始数据框与这样的展开数据框:
from pyspark.sql.functions import split, col, explode, lit, expr
# create new df with product = 'no delimiter in rev' and quantity= 0
df1 = df.withColumn("product", lit("no delimiter in rev")) \
.withColumn("quantity", lit(0))
# create another df by exploding the product/quantity structure and revenue=0
df2 = df.withColumn('product', split(col("product"), "\|")) \
.withColumn('quantity', split(col("quantity"), "\|")) \
.withColumn("product_quantity", explode(expr("arrays_zip(product, quantity)"))) \
.selectExpr("product_quantity.*", "0 as revenue")
# union the 2 data frames
df1.union(df2).show()
#+-------------------+--------+-------+
#| product|quantity|revenue|
#+-------------------+--------+-------+
#|no delimiter in rev| 0| 3|
#|no delimiter in rev| 0| 9|
#| a| 1| 0|
#| b| 1| 0|
#| b| 3| 0|
#| c| 2| 0|
#+-------------------+--------+-------+