将 DataFrame 中的字符串数组拆分为自己的列
Split Array of Strings in a DataFrame into their own columns
我有一个这样的数据框:
df.show()
+-----+
|col1 |
+-----+
|[a,b]|
|[c,d]|
+-----+
如何将其转换成如下数据框
+----+----+
|col1|col2|
+----+----+
| a| b|
| c| d|
+----+----+
这取决于你的类型 "list":
如果是ArrayType()
类型:
df = spark.createDataFrame(spark.sparkContext.parallelize([['a', ["a","b","c"]], ['b', ["d","e","f"]]]), ["key", "col"])
df.printSchema()
df.show()
root
|-- key: string (nullable = true)
|-- col: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---------+
|key| col|
+---+---------+
| a|[a, b, c]|
| b|[d, e, f]|
+---+---------+
- 您可以像使用 python 一样使用
[]
: 访问这些值
df.select("key", df.col[0], df.col[1], df.col[2]).show()
+---+------+------+------+
|key|col[0]|col[1]|col[2]|
+---+------+------+------+
| a| a| b| c|
| b| d| e| f|
+---+------+------+------+
- 如果它是
StructType()
类型:(也许您通过阅读 JSON 构建了数据框)
df2 = df.select("key", F.struct(
df.col[0].alias("col1"),
df.col[1].alias("col2"),
df.col[2].alias("col3")
).alias("col"))
df2.printSchema()
df2.show()
root
|-- key: string (nullable = true)
|-- col: struct (nullable = false)
| |-- col1: string (nullable = true)
| |-- col2: string (nullable = true)
| |-- col3: string (nullable = true)
+---+---------+
|key| col|
+---+---------+
| a|[a, b, c]|
| b|[d, e, f]|
+---+---------+
- 你可以直接'split'列使用
*
:
df2.select('key', 'col.*').show()
+---+----+----+----+
|key|col1|col2|col3|
+---+----+----+----+
| a| a| b| c|
| b| d| e| f|
+---+----+----+----+
我有一个这样的数据框:
df.show()
+-----+
|col1 |
+-----+
|[a,b]|
|[c,d]|
+-----+
如何将其转换成如下数据框
+----+----+
|col1|col2|
+----+----+
| a| b|
| c| d|
+----+----+
这取决于你的类型 "list":
如果是ArrayType()
类型:
df = spark.createDataFrame(spark.sparkContext.parallelize([['a', ["a","b","c"]], ['b', ["d","e","f"]]]), ["key", "col"])
df.printSchema()
df.show()
root
|-- key: string (nullable = true)
|-- col: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---------+
|key| col|
+---+---------+
| a|[a, b, c]|
| b|[d, e, f]|
+---+---------+
- 您可以像使用 python 一样使用
[]
: 访问这些值
df.select("key", df.col[0], df.col[1], df.col[2]).show()
+---+------+------+------+
|key|col[0]|col[1]|col[2]|
+---+------+------+------+
| a| a| b| c|
| b| d| e| f|
+---+------+------+------+
- 如果它是
StructType()
类型:(也许您通过阅读 JSON 构建了数据框)
df2 = df.select("key", F.struct(
df.col[0].alias("col1"),
df.col[1].alias("col2"),
df.col[2].alias("col3")
).alias("col"))
df2.printSchema()
df2.show()
root
|-- key: string (nullable = true)
|-- col: struct (nullable = false)
| |-- col1: string (nullable = true)
| |-- col2: string (nullable = true)
| |-- col3: string (nullable = true)
+---+---------+
|key| col|
+---+---------+
| a|[a, b, c]|
| b|[d, e, f]|
+---+---------+
- 你可以直接'split'列使用
*
:
df2.select('key', 'col.*').show()
+---+----+----+----+
|key|col1|col2|col3|
+---+----+----+----+
| a| a| b| c|
| b| d| e| f|
+---+----+----+----+