通过使用 pyspark 将列转换为行来解析数据框中 Json 字符串的列表？

Question

//Input
df = [ 
      ('{"id":10, "number" : ["1.1", "1.2", "1.3"]}',),
      ('{"id":20, "number" : ["2.1", "2.2", "2.3"]}',),
       ]

//数据帧中的期望输出

id	number
10	1.1
10	1.2
10	1.3
20	2.1
20	2.2
20	2.3

我试过 withColumn 但只能将它分成两列

df.withColumn("n",from_json(col("_1"),Sch)).select("n.*")

如何让第 2 列拆分成行，并为 pyspark 中的每个数字重复第 1 列？

如有任何帮助，我们将不胜感激！ TIA！

Answer 1

你可以在这里使用explode例如

from pyspark.sql import functions as F

df.withColumn("n",F.from_json(F.col("_1"),Sch))\
  .select("n.id",F.explode("n.number").alias("number"))

下面包含完整的可重现示例：


from pyspark.sql import functions as F
from pyspark.sql import types as T

df = [ 
      ('{"id":10, "number" : ["1.1", "1.2", "1.3"]}',),
      ('{"id":20, "number" : ["2.1", "2.2", "2.3"]}',),
       ]

Sch = T.StructType([
    T.StructField("id",T.IntegerType(),True),
    T.StructField("number",T.ArrayType(T.StringType()),True)
])
df = sparkSession.createDataFrame(df)
df.show()

df.withColumn("n",F.from_json(F.col("_1"),Sch))\
  .select("n.id",F.explode("n.number").alias("number"))\
  .show(truncate=False)

输出

+-------------------------------------------+
|_1                                         |
+-------------------------------------------+
|{"id":10, "number" : ["1.1", "1.2", "1.3"]}|
|{"id":20, "number" : ["2.1", "2.2", "2.3"]}|
+-------------------------------------------+

+----+------+
|n.id|number|
+----+------+
|10  |1.1   |
|10  |1.2   |
|10  |1.3   |
|20  |2.1   |
|20  |2.2   |
|20  |2.3   |
+----+------+

让我知道这是否适合你。

通过使用 pyspark 将列转换为行来解析数据框中 Json 字符串的列表？

Parse list of Json strings in a dataframe by converting the column into rows using pyspark?

python

parsing

json

apache-spark

pyspark