通过使用 pyspark 将列转换为行来解析数据框中 Json 字符串的列表?
Parse list of Json strings in a dataframe by converting the column into rows using pyspark?
//Input
df = [
('{"id":10, "number" : ["1.1", "1.2", "1.3"]}',),
('{"id":20, "number" : ["2.1", "2.2", "2.3"]}',),
]
//数据帧中的期望输出
id
number
10
1.1
10
1.2
10
1.3
20
2.1
20
2.2
20
2.3
我试过 withColumn 但只能将它分成两列
df.withColumn("n",from_json(col("_1"),Sch)).select("n.*")
如何让第 2 列拆分成行,并为 pyspark 中的每个数字重复第 1 列?
如有任何帮助,我们将不胜感激! TIA!
你可以在这里使用explode
例如
from pyspark.sql import functions as F
df.withColumn("n",F.from_json(F.col("_1"),Sch))\
.select("n.id",F.explode("n.number").alias("number"))
下面包含完整的可重现示例:
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = [
('{"id":10, "number" : ["1.1", "1.2", "1.3"]}',),
('{"id":20, "number" : ["2.1", "2.2", "2.3"]}',),
]
Sch = T.StructType([
T.StructField("id",T.IntegerType(),True),
T.StructField("number",T.ArrayType(T.StringType()),True)
])
df = sparkSession.createDataFrame(df)
df.show()
df.withColumn("n",F.from_json(F.col("_1"),Sch))\
.select("n.id",F.explode("n.number").alias("number"))\
.show(truncate=False)
输出
+-------------------------------------------+
|_1 |
+-------------------------------------------+
|{"id":10, "number" : ["1.1", "1.2", "1.3"]}|
|{"id":20, "number" : ["2.1", "2.2", "2.3"]}|
+-------------------------------------------+
+----+------+
|n.id|number|
+----+------+
|10 |1.1 |
|10 |1.2 |
|10 |1.3 |
|20 |2.1 |
|20 |2.2 |
|20 |2.3 |
+----+------+
让我知道这是否适合你。
//Input
df = [
('{"id":10, "number" : ["1.1", "1.2", "1.3"]}',),
('{"id":20, "number" : ["2.1", "2.2", "2.3"]}',),
]
//数据帧中的期望输出
id | number |
---|---|
10 | 1.1 |
10 | 1.2 |
10 | 1.3 |
20 | 2.1 |
20 | 2.2 |
20 | 2.3 |
我试过 withColumn 但只能将它分成两列
df.withColumn("n",from_json(col("_1"),Sch)).select("n.*")
如何让第 2 列拆分成行,并为 pyspark 中的每个数字重复第 1 列?
如有任何帮助,我们将不胜感激! TIA!
你可以在这里使用explode
例如
from pyspark.sql import functions as F
df.withColumn("n",F.from_json(F.col("_1"),Sch))\
.select("n.id",F.explode("n.number").alias("number"))
下面包含完整的可重现示例:
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = [
('{"id":10, "number" : ["1.1", "1.2", "1.3"]}',),
('{"id":20, "number" : ["2.1", "2.2", "2.3"]}',),
]
Sch = T.StructType([
T.StructField("id",T.IntegerType(),True),
T.StructField("number",T.ArrayType(T.StringType()),True)
])
df = sparkSession.createDataFrame(df)
df.show()
df.withColumn("n",F.from_json(F.col("_1"),Sch))\
.select("n.id",F.explode("n.number").alias("number"))\
.show(truncate=False)
输出
+-------------------------------------------+
|_1 |
+-------------------------------------------+
|{"id":10, "number" : ["1.1", "1.2", "1.3"]}|
|{"id":20, "number" : ["2.1", "2.2", "2.3"]}|
+-------------------------------------------+
+----+------+
|n.id|number|
+----+------+
|10 |1.1 |
|10 |1.2 |
|10 |1.3 |
|20 |2.1 |
|20 |2.2 |
|20 |2.3 |
+----+------+
让我知道这是否适合你。