PySpark Dataframe 将列融为行
PySpark Dataframe melt columns into rows
如主题所述,我有一个 PySpark Dataframe,我需要将三列合并成行。每列本质上代表一个类别中的一个事实。最终目标是将数据汇总为每个类别的单个总数。
这个数据帧中有数千万行,所以我需要一种方法来在 spark 集群上进行转换,而不会将任何数据返回给驱动程序(在本例中为 Jupyter)。
这是我的几家商店的数据框的摘录:
+-----------+----------------+-----------------+----------------+
| store_id |qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs|
+-----------+----------------+-----------------+----------------+
| 100| 30| 105| 35|
| 200| 55| 85| 65|
| 300| 20| 125| 90|
+-----------+----------------+-----------------+----------------+
这是所需的结果数据框,每个商店多行,其中原始数据框的列已融合到新数据框的行中,新类别列中每个原始列一行:
+-----------+--------+-----------+
| product_id|CATEGORY|qty_on_hand|
+-----------+--------+-----------+
| 100| milk| 30|
| 100| bread| 105|
| 100| eggs| 35|
| 200| milk| 55|
| 200| bread| 85|
| 200| eggs| 65|
| 300| milk| 20|
| 300| bread| 125|
| 300| eggs| 90|
+-----------+--------+-----------+
最终,我想聚合生成的数据框以获得每个类别的总数:
+--------+-----------------+
|CATEGORY|total_qty_on_hand|
+--------+-----------------+
| milk| 105|
| bread| 315|
| eggs| 190|
+--------+-----------------+
更新:
有提示说这个问题是重复的,可以回答。事实并非如此,因为解决方案将行转换为列,而我需要进行相反的操作,将列融合为行。
一种可能的方法是使用 - col,when, functions
pyspark 的模块
>>> from pyspark.sql import functions as F
>>> from pyspark.sql.functions import *
>>> from pyspark.sql.types import StringType
>>> concat_udf = F.udf(lambda cols: "".join([str(x) if x is not None else "*" for x in cols]), StringType())
>>> rdd = sc.parallelize([[100,30,105,35],[200,55,85,65],[300,20,125,90]])
>>> df = rdd.toDF(['store_id','qty_on_hand_milk','qty_on_hand_bread','qty_on_hand_eggs'])
>>> df.show()
+--------+----------------+-----------------+----------------+
|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs|
+--------+----------------+-----------------+----------------+
| 100| 30| 105| 35|
| 200| 55| 85| 65|
| 300| 20| 125| 90|
+--------+----------------+-----------------+----------------+
#adding one more column with arrayed values of all three columns
>>> df_1=df.withColumn("new_col", concat_udf(F.array("qty_on_hand_milk", "qty_on_hand_bread","qty_on_hand_eggs")))
#convert it into array<int> for carrying out agg operations
>>> df_2=df_1.withColumn("new_col_1",split(col("new_col"), ",\s*").cast("array<int>").alias("new_col_1"))
#posexplode gives you the position along with usual explode which helps in categorizing
>>> df_3=df_2.select("store_id", posexplode("new_col_1").alias("col_1","qty"))
#if else conditioning for category column
>>> df_3.withColumn("category",F.when(col("col_1") == 0, "milk").when(col("col_1") == 1, "bread").otherwise("eggs")).select("store_id","category","qty").show()
+--------+--------+---+
|store_id|category|qty|
+--------+--------+---+
| 100| milk| 30|
| 100| bread|105|
| 100| eggs| 35|
| 200| milk| 55|
| 200| bread| 85|
| 200| eggs| 65|
| 300| milk| 20|
| 300| bread|125|
| 300| eggs| 90|
+--------+--------+---+
#aggregating to find sum
>>> df_3.withColumn("category",F.when(col("col_1") == 0, "milk").when(col("col_1") == 1, "bread").otherwise("eggs")).select("category","qty").groupBy('category').sum().show()
+--------+--------+
|category|sum(qty)|
+--------+--------+
| eggs| 190|
| bread| 315|
| milk| 105|
+--------+--------+
>>> df_3.printSchema()
root
|-- store_id: long (nullable = true)
|-- col_1: integer (nullable = false)
|-- qty: integer (nullable = true)
我们可以使用explode()函数来解决这个问题。在 Python 中,同样的事情可以用 melt
来完成
# Loading the requisite packages
from pyspark.sql.functions import col, explode, array, struct, expr, sum, lit
# Creating the DataFrame
df = sqlContext.createDataFrame([(100,30,105,35),(200,55,85,65),(300,20,125,90)],('store_id','qty_on_hand_milk','qty_on_hand_bread','qty_on_hand_eggs'))
df.show()
+--------+----------------+-----------------+----------------+
|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs|
+--------+----------------+-----------------+----------------+
| 100| 30| 105| 35|
| 200| 55| 85| 65|
| 300| 20| 125| 90|
+--------+----------------+-----------------+----------------+
编写下面的函数,它将explode
这个DataFrame:
def to_explode(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("CATEGORY"), col(c).alias("qty_on_hand")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.CATEGORY", "kvs.qty_on_hand"])
正在将此 DataFrame 上的函数应用到 explode
it-
df = to_explode(df, ['store_id'])\
.drop('store_id')
df.show()
+-----------------+-----------+
| CATEGORY|qty_on_hand|
+-----------------+-----------+
| qty_on_hand_milk| 30|
|qty_on_hand_bread| 105|
| qty_on_hand_eggs| 35|
| qty_on_hand_milk| 55|
|qty_on_hand_bread| 85|
| qty_on_hand_eggs| 65|
| qty_on_hand_milk| 20|
|qty_on_hand_bread| 125|
| qty_on_hand_eggs| 90|
+-----------------+-----------+
现在,我们需要从 CATEGORY
列中删除字符串 qty_on_hand_
。可以使用 expr() 函数来完成。注意 expr
遵循基于 1 的子字符串索引,而不是 0 -
df = df.withColumn('CATEGORY',expr('substring(CATEGORY, 13)'))
df.show()
+--------+-----------+
|CATEGORY|qty_on_hand|
+--------+-----------+
| milk| 30|
| bread| 105|
| eggs| 35|
| milk| 55|
| bread| 85|
| eggs| 65|
| milk| 20|
| bread| 125|
| eggs| 90|
+--------+-----------+
最后,使用 agg() 函数 -
聚合按 CATEGORY
分组的列 qty_on_hand
df = df.groupBy(['CATEGORY']).agg(sum('qty_on_hand').alias('total_qty_on_hand'))
df.show()
+--------+-----------------+
|CATEGORY|total_qty_on_hand|
+--------+-----------------+
| eggs| 190|
| bread| 315|
| milk| 105|
+--------+-----------------+
我认为您应该使用 array
和 explode
来执行此操作,您不需要任何带有 UDF 或自定义函数的复杂逻辑。
array
将列合并为一列,或注释列。
explode
将数组列转换为一组行。
您只需:
- 用您的自定义标签对每一列进行注释(例如 'milk')
- 将带标签的列合并为一个 'array' 类型的列
- 分解标签列以生成带标签的行
- 删除不相关的列
df = (
df.withColumn('labels', F.explode( # <-- Split into rows
F.array( # <-- Combine columns
F.array(F.lit('milk'), F.col('qty_on_hand_milk')), # <-- Annotate column
F.array(F.lit('bread'), F.col('qty_on_hand_bread')),
F.array(F.lit('eggs'), F.col('qty_on_hand_eggs')),
)
))
.withColumn('CATEGORY', F.col('labels')[0])
.withColumn('qty_on_hand', F.col('labels')[1])
).select('store_id', 'CATEGORY', 'qty_on_hand')
请注意如何简单地使用 col('foo')[INDEX]
提取数组列的元素;
没有特别需要将它们分成单独的列。
这种方法对不同的数据类型也很稳健,因为它不会尝试对每一行强制使用相同的模式(与使用结构不同)。
例如。如果 'qty_on_hand_bread' 是一个字符串,这仍然有效,结果模式将只是:
root
|-- store_id: long (nullable = false)
|-- CATEGORY: string (nullable = true)
|-- qty_on_hand: string (nullable = true) <-- Picks best schema on the fly
这是相同的代码,一步一步地使这里发生的事情一目了然:
import databricks.koalas as ks
import pyspark.sql.functions as F
# You don't need koalas, it's just less verbose for adhoc dataframes
df = ks.DataFrame({
"store_id": [100, 200, 300],
"qty_on_hand_milk": [30, 55, 20],
"qty_on_hand_bread": [105, 85, 125],
"qty_on_hand_eggs": [35, 65, 90],
}).to_spark()
df.show()
# Annotate each column with your custom label per row. ie. v -> ['label', v]
df = df.withColumn('label1', F.array(F.lit('milk'), F.col('qty_on_hand_milk')))
df = df.withColumn('label2', F.array(F.lit('bread'), F.col('qty_on_hand_bread')))
df = df.withColumn('label3', F.array(F.lit('eggs'), F.col('qty_on_hand_eggs')))
df.show()
# Create a new column which combines the labeled values in a single column
df = df.withColumn('labels', F.array('label1', 'label2', 'label3'))
df.show()
# Split into individual rows
df = df.withColumn('labels', F.explode('labels'))
df.show()
# You can now do whatever you want with your labelled rows, eg. split them into new columns
df = df.withColumn('CATEGORY', F.col('labels')[0])
df = df.withColumn('qty_on_hand', F.col('labels')[1])
df.show()
...以及每一步的输出:
|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs|
+--------+----------------+-----------------+----------------+
| 100| 30| 105| 35|
| 200| 55| 85| 65|
| 300| 20| 125| 90|
+--------+----------------+-----------------+----------------+
+--------+----------------+-----------------+----------------+----------+------------+----------+
|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs| label1| label2| label3|
+--------+----------------+-----------------+----------------+----------+------------+----------+
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]|
+--------+----------------+-----------------+----------------+----------+------------+----------+
+--------+----------------+-----------------+----------------+----------+------------+----------+--------------------+
|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs| label1| label2| label3| labels|
+--------+----------------+-----------------+----------------+----------+------------+----------+--------------------+
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]|[[milk, 30], [bre...|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]|[[milk, 55], [bre...|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]|[[milk, 20], [bre...|
+--------+----------------+-----------------+----------------+----------+------------+----------+--------------------+
+--------+----------------+-----------------+----------------+----------+------------+----------+------------+
|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs| label1| label2| label3| labels|
+--------+----------------+-----------------+----------------+----------+------------+----------+------------+
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]| [milk, 30]|
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]|[bread, 105]|
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]| [eggs, 35]|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]| [milk, 55]|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]| [bread, 85]|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]| [eggs, 65]|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]| [milk, 20]|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]|[bread, 125]|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]| [eggs, 90]|
+--------+----------------+-----------------+----------------+----------+------------+----------+------------+
+--------+----------------+-----------------+----------------+----------+------------+----------+------------+--------+-----------+
|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs| label1| label2| label3| labels|CATEGORY|qty_on_hand|
+--------+----------------+-----------------+----------------+----------+------------+----------+------------+--------+-----------+
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]| [milk, 30]| milk| 30|
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]|[bread, 105]| bread| 105|
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]| [eggs, 35]| eggs| 35|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]| [milk, 55]| milk| 55|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]| [bread, 85]| bread| 85|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]| [eggs, 65]| eggs| 65|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]| [milk, 20]| milk| 20|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]|[bread, 125]| bread| 125|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]| [eggs, 90]| eggs| 90|
+--------+----------------+-----------------+----------------+----------+------------+----------+------------+--------+-----------+
+--------+--------+-----------+
|store_id|CATEGORY|qty_on_hand|
+--------+--------+-----------+
| 100| milk| 30|
| 100| bread| 105|
| 100| eggs| 35|
| 200| milk| 55|
| 200| bread| 85|
| 200| eggs| 65|
| 300| milk| 20|
| 300| bread| 125|
| 300| eggs| 90|
+--------+--------+-----------+
这是一个实现它的函数
def melt(df,cols,alias=('key','value')):
other = [col for col in df.columns if col not in cols]
for c in cols:
df = df.withColumn(c, F.expr(f'map("{c}", cast({c} as double))'))
df = df.withColumn('melted_cols', F.map_concat(*cols))
return df.select(*other,F.explode('melted_cols').alias(*alias))
如主题所述,我有一个 PySpark Dataframe,我需要将三列合并成行。每列本质上代表一个类别中的一个事实。最终目标是将数据汇总为每个类别的单个总数。
这个数据帧中有数千万行,所以我需要一种方法来在 spark 集群上进行转换,而不会将任何数据返回给驱动程序(在本例中为 Jupyter)。
这是我的几家商店的数据框的摘录:
+-----------+----------------+-----------------+----------------+
| store_id |qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs|
+-----------+----------------+-----------------+----------------+
| 100| 30| 105| 35|
| 200| 55| 85| 65|
| 300| 20| 125| 90|
+-----------+----------------+-----------------+----------------+
这是所需的结果数据框,每个商店多行,其中原始数据框的列已融合到新数据框的行中,新类别列中每个原始列一行:
+-----------+--------+-----------+
| product_id|CATEGORY|qty_on_hand|
+-----------+--------+-----------+
| 100| milk| 30|
| 100| bread| 105|
| 100| eggs| 35|
| 200| milk| 55|
| 200| bread| 85|
| 200| eggs| 65|
| 300| milk| 20|
| 300| bread| 125|
| 300| eggs| 90|
+-----------+--------+-----------+
最终,我想聚合生成的数据框以获得每个类别的总数:
+--------+-----------------+
|CATEGORY|total_qty_on_hand|
+--------+-----------------+
| milk| 105|
| bread| 315|
| eggs| 190|
+--------+-----------------+
更新:
有提示说这个问题是重复的,可以回答
一种可能的方法是使用 - col,when, functions
pyspark 的模块
>>> from pyspark.sql import functions as F
>>> from pyspark.sql.functions import *
>>> from pyspark.sql.types import StringType
>>> concat_udf = F.udf(lambda cols: "".join([str(x) if x is not None else "*" for x in cols]), StringType())
>>> rdd = sc.parallelize([[100,30,105,35],[200,55,85,65],[300,20,125,90]])
>>> df = rdd.toDF(['store_id','qty_on_hand_milk','qty_on_hand_bread','qty_on_hand_eggs'])
>>> df.show()
+--------+----------------+-----------------+----------------+
|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs|
+--------+----------------+-----------------+----------------+
| 100| 30| 105| 35|
| 200| 55| 85| 65|
| 300| 20| 125| 90|
+--------+----------------+-----------------+----------------+
#adding one more column with arrayed values of all three columns
>>> df_1=df.withColumn("new_col", concat_udf(F.array("qty_on_hand_milk", "qty_on_hand_bread","qty_on_hand_eggs")))
#convert it into array<int> for carrying out agg operations
>>> df_2=df_1.withColumn("new_col_1",split(col("new_col"), ",\s*").cast("array<int>").alias("new_col_1"))
#posexplode gives you the position along with usual explode which helps in categorizing
>>> df_3=df_2.select("store_id", posexplode("new_col_1").alias("col_1","qty"))
#if else conditioning for category column
>>> df_3.withColumn("category",F.when(col("col_1") == 0, "milk").when(col("col_1") == 1, "bread").otherwise("eggs")).select("store_id","category","qty").show()
+--------+--------+---+
|store_id|category|qty|
+--------+--------+---+
| 100| milk| 30|
| 100| bread|105|
| 100| eggs| 35|
| 200| milk| 55|
| 200| bread| 85|
| 200| eggs| 65|
| 300| milk| 20|
| 300| bread|125|
| 300| eggs| 90|
+--------+--------+---+
#aggregating to find sum
>>> df_3.withColumn("category",F.when(col("col_1") == 0, "milk").when(col("col_1") == 1, "bread").otherwise("eggs")).select("category","qty").groupBy('category').sum().show()
+--------+--------+
|category|sum(qty)|
+--------+--------+
| eggs| 190|
| bread| 315|
| milk| 105|
+--------+--------+
>>> df_3.printSchema()
root
|-- store_id: long (nullable = true)
|-- col_1: integer (nullable = false)
|-- qty: integer (nullable = true)
我们可以使用explode()函数来解决这个问题。在 Python 中,同样的事情可以用 melt
# Loading the requisite packages
from pyspark.sql.functions import col, explode, array, struct, expr, sum, lit
# Creating the DataFrame
df = sqlContext.createDataFrame([(100,30,105,35),(200,55,85,65),(300,20,125,90)],('store_id','qty_on_hand_milk','qty_on_hand_bread','qty_on_hand_eggs'))
df.show()
+--------+----------------+-----------------+----------------+
|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs|
+--------+----------------+-----------------+----------------+
| 100| 30| 105| 35|
| 200| 55| 85| 65|
| 300| 20| 125| 90|
+--------+----------------+-----------------+----------------+
编写下面的函数,它将explode
这个DataFrame:
def to_explode(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("CATEGORY"), col(c).alias("qty_on_hand")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.CATEGORY", "kvs.qty_on_hand"])
正在将此 DataFrame 上的函数应用到 explode
it-
df = to_explode(df, ['store_id'])\
.drop('store_id')
df.show()
+-----------------+-----------+
| CATEGORY|qty_on_hand|
+-----------------+-----------+
| qty_on_hand_milk| 30|
|qty_on_hand_bread| 105|
| qty_on_hand_eggs| 35|
| qty_on_hand_milk| 55|
|qty_on_hand_bread| 85|
| qty_on_hand_eggs| 65|
| qty_on_hand_milk| 20|
|qty_on_hand_bread| 125|
| qty_on_hand_eggs| 90|
+-----------------+-----------+
现在,我们需要从 CATEGORY
列中删除字符串 qty_on_hand_
。可以使用 expr() 函数来完成。注意 expr
遵循基于 1 的子字符串索引,而不是 0 -
df = df.withColumn('CATEGORY',expr('substring(CATEGORY, 13)'))
df.show()
+--------+-----------+
|CATEGORY|qty_on_hand|
+--------+-----------+
| milk| 30|
| bread| 105|
| eggs| 35|
| milk| 55|
| bread| 85|
| eggs| 65|
| milk| 20|
| bread| 125|
| eggs| 90|
+--------+-----------+
最后,使用 agg() 函数 -
聚合按CATEGORY
分组的列 qty_on_hand
df = df.groupBy(['CATEGORY']).agg(sum('qty_on_hand').alias('total_qty_on_hand'))
df.show()
+--------+-----------------+
|CATEGORY|total_qty_on_hand|
+--------+-----------------+
| eggs| 190|
| bread| 315|
| milk| 105|
+--------+-----------------+
我认为您应该使用 array
和 explode
来执行此操作,您不需要任何带有 UDF 或自定义函数的复杂逻辑。
array
将列合并为一列,或注释列。
explode
将数组列转换为一组行。
您只需:
- 用您的自定义标签对每一列进行注释(例如 'milk')
- 将带标签的列合并为一个 'array' 类型的列
- 分解标签列以生成带标签的行
- 删除不相关的列
df = (
df.withColumn('labels', F.explode( # <-- Split into rows
F.array( # <-- Combine columns
F.array(F.lit('milk'), F.col('qty_on_hand_milk')), # <-- Annotate column
F.array(F.lit('bread'), F.col('qty_on_hand_bread')),
F.array(F.lit('eggs'), F.col('qty_on_hand_eggs')),
)
))
.withColumn('CATEGORY', F.col('labels')[0])
.withColumn('qty_on_hand', F.col('labels')[1])
).select('store_id', 'CATEGORY', 'qty_on_hand')
请注意如何简单地使用 col('foo')[INDEX]
提取数组列的元素;
没有特别需要将它们分成单独的列。
这种方法对不同的数据类型也很稳健,因为它不会尝试对每一行强制使用相同的模式(与使用结构不同)。
例如。如果 'qty_on_hand_bread' 是一个字符串,这仍然有效,结果模式将只是:
root
|-- store_id: long (nullable = false)
|-- CATEGORY: string (nullable = true)
|-- qty_on_hand: string (nullable = true) <-- Picks best schema on the fly
这是相同的代码,一步一步地使这里发生的事情一目了然:
import databricks.koalas as ks
import pyspark.sql.functions as F
# You don't need koalas, it's just less verbose for adhoc dataframes
df = ks.DataFrame({
"store_id": [100, 200, 300],
"qty_on_hand_milk": [30, 55, 20],
"qty_on_hand_bread": [105, 85, 125],
"qty_on_hand_eggs": [35, 65, 90],
}).to_spark()
df.show()
# Annotate each column with your custom label per row. ie. v -> ['label', v]
df = df.withColumn('label1', F.array(F.lit('milk'), F.col('qty_on_hand_milk')))
df = df.withColumn('label2', F.array(F.lit('bread'), F.col('qty_on_hand_bread')))
df = df.withColumn('label3', F.array(F.lit('eggs'), F.col('qty_on_hand_eggs')))
df.show()
# Create a new column which combines the labeled values in a single column
df = df.withColumn('labels', F.array('label1', 'label2', 'label3'))
df.show()
# Split into individual rows
df = df.withColumn('labels', F.explode('labels'))
df.show()
# You can now do whatever you want with your labelled rows, eg. split them into new columns
df = df.withColumn('CATEGORY', F.col('labels')[0])
df = df.withColumn('qty_on_hand', F.col('labels')[1])
df.show()
...以及每一步的输出:
|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs|
+--------+----------------+-----------------+----------------+
| 100| 30| 105| 35|
| 200| 55| 85| 65|
| 300| 20| 125| 90|
+--------+----------------+-----------------+----------------+
+--------+----------------+-----------------+----------------+----------+------------+----------+
|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs| label1| label2| label3|
+--------+----------------+-----------------+----------------+----------+------------+----------+
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]|
+--------+----------------+-----------------+----------------+----------+------------+----------+
+--------+----------------+-----------------+----------------+----------+------------+----------+--------------------+
|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs| label1| label2| label3| labels|
+--------+----------------+-----------------+----------------+----------+------------+----------+--------------------+
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]|[[milk, 30], [bre...|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]|[[milk, 55], [bre...|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]|[[milk, 20], [bre...|
+--------+----------------+-----------------+----------------+----------+------------+----------+--------------------+
+--------+----------------+-----------------+----------------+----------+------------+----------+------------+
|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs| label1| label2| label3| labels|
+--------+----------------+-----------------+----------------+----------+------------+----------+------------+
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]| [milk, 30]|
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]|[bread, 105]|
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]| [eggs, 35]|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]| [milk, 55]|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]| [bread, 85]|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]| [eggs, 65]|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]| [milk, 20]|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]|[bread, 125]|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]| [eggs, 90]|
+--------+----------------+-----------------+----------------+----------+------------+----------+------------+
+--------+----------------+-----------------+----------------+----------+------------+----------+------------+--------+-----------+
|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs| label1| label2| label3| labels|CATEGORY|qty_on_hand|
+--------+----------------+-----------------+----------------+----------+------------+----------+------------+--------+-----------+
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]| [milk, 30]| milk| 30|
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]|[bread, 105]| bread| 105|
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]| [eggs, 35]| eggs| 35|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]| [milk, 55]| milk| 55|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]| [bread, 85]| bread| 85|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]| [eggs, 65]| eggs| 65|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]| [milk, 20]| milk| 20|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]|[bread, 125]| bread| 125|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]| [eggs, 90]| eggs| 90|
+--------+----------------+-----------------+----------------+----------+------------+----------+------------+--------+-----------+
+--------+--------+-----------+
|store_id|CATEGORY|qty_on_hand|
+--------+--------+-----------+
| 100| milk| 30|
| 100| bread| 105|
| 100| eggs| 35|
| 200| milk| 55|
| 200| bread| 85|
| 200| eggs| 65|
| 300| milk| 20|
| 300| bread| 125|
| 300| eggs| 90|
+--------+--------+-----------+
这是一个实现它的函数
def melt(df,cols,alias=('key','value')):
other = [col for col in df.columns if col not in cols]
for c in cols:
df = df.withColumn(c, F.expr(f'map("{c}", cast({c} as double))'))
df = df.withColumn('melted_cols', F.map_concat(*cols))
return df.select(*other,F.explode('melted_cols').alias(*alias))