Pyspark:创建 Dataframe - Map 类型中的布尔字段被解析为 null
Pyspark: Create Dataframe - Boolean fields in Map type are parsed as null
我正在从下面的 python 列表创建数据框,
_test = [('val1', {'key1': ['A', 'B'], 'key2': ['C'], 'bool_key1': True, 'bool_key2': True}),
('val2', {'key1': ['B'], 'key2': ['D'], 'bool_key1': False, 'bool_key2': None})]
df_test = spark.createDataFrame(_test, schema = ["col1","col2"])
df_test.show(truncate=False)
但是,结果数据框的所有布尔字段都为空!
+----+---------------------------------------------------------+
|col1|col2 |
+----+---------------------------------------------------------+
|val1|[key1 -> [A, B], bool_key2 ->, key2 -> [C], bool_key1 ->]|
|val2|[key1 -> [B], bool_key2 ->, key2 -> [D], bool_key1 ->] |
+----+---------------------------------------------------------+
df_test 数据框架构
root
|-- col1: string (nullable = true)
|-- col2: map (nullable = true)
| |-- key: string
| |-- value: array (valueContainsNull = true)
| | |-- element: string (containsNull = true)
我可以在不更改 python 变量结构的情况下创建数据框有什么帮助吗?
定义架构,不要使用元组定义行。使用列表。试试下面的代码
_test1 = [["val1",{"key1": ["A", "B"], "key2": ["C"], "bool_key1": True, "bool_key2": True}],
["val1",{"key1": ["A", "B"], "key2": ["C"], "bool_key1": True, "bool_key2": True}],
["val2", {"key1": ["B"], "key2": ["D"], "bool_key1": False, "bool_key2": None}]]
df2=spark.createDataFrame(_test1, 'col1 string, col2 struct<key1:array<string>,key2:array<string>,bool_key1:boolean,bool_key1:boolean>')
df2.show(truncate=False)
+----+-------------------------+
|col1|col2 |
+----+-------------------------+
|val1|{[A, B], [C], true, true}|
|val1|{[A, B], [C], true, true}|
|val2|{[B], [D], false, false} |
+----+-------------------------+
root
|-- col1: string (nullable = true)
|-- col2: struct (nullable = true)
| |-- key1: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- key2: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- bool_key1: boolean (nullable = true)
| |-- bool_key1: boolean (nullable = true)
添加到@wwnde 的答案中,还有另一种定义结构模式的方法(尽管个人更喜欢@wwnde 的答案(更少的代码行))-
定义 struct
架构 -
from pyspark.sql.types import *
schema = StructType(
[
StructField("col1", StringType()),
StructField("col2", StructType([
StructField("key1", ArrayType(StringType())),
StructField("key2", ArrayType(StringType())),
StructField("bool_key1", BooleanType()),
StructField("bool_key2", BooleanType())
]
)
)
]
)
正在创建 dataframe
-
_test = [
('val1', {'key1': ['A', 'B'], 'key2': ['C'], 'bool_key1': True, 'bool_key2': True}),
('val2', {'key1': ['B'], 'key2': ['D'], 'bool_key1': False, 'bool_key2': None})
]
df=spark.createDataFrame(data=_test, schema=schema)
df.printSchema()
输出
root
|-- col1: string (nullable = true)
|-- col2: struct (nullable = true)
| |-- key1: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- key2: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- bool_key1: boolean (nullable = true)
| |-- bool_key2: boolean (nullable = true)
如果你想让 MapType
key
value
对完好无损,请尝试使用以下逻辑 -
_test = [
('val1', {'key1': ['A', 'B'], 'key2': ['C'], 'bool_key1': True, 'bool_key2': True}),
('val2', {'key1': ['B'], 'key2': ['D'], 'bool_key1': False, 'bool_key2': None})
]
schema = StructType([
StructField("col1", StringType()),
StructField("col2", (MapType(StringType(), StringType())))
])
spark.createDataFrame(_test, schema=["col1", "col2"]).show(truncate=False)
df_test = spark.createDataFrame(data = _test, schema = schema)
df_test.show(truncate=False)
+----+-------------------------------------------------------------------+
|col1|col2 |
+----+-------------------------------------------------------------------+
|val1|{key1 -> [A, B], bool_key2 -> true, key2 -> [C], bool_key1 -> true}|
|val2|{key1 -> [B], bool_key2 -> null, key2 -> [D], bool_key1 -> false} |
+----+-------------------------------------------------------------------+
df_test.printSchema()
root
|-- col1: string (nullable = true)
|-- col2: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
我正在从下面的 python 列表创建数据框,
_test = [('val1', {'key1': ['A', 'B'], 'key2': ['C'], 'bool_key1': True, 'bool_key2': True}),
('val2', {'key1': ['B'], 'key2': ['D'], 'bool_key1': False, 'bool_key2': None})]
df_test = spark.createDataFrame(_test, schema = ["col1","col2"])
df_test.show(truncate=False)
但是,结果数据框的所有布尔字段都为空!
+----+---------------------------------------------------------+
|col1|col2 |
+----+---------------------------------------------------------+
|val1|[key1 -> [A, B], bool_key2 ->, key2 -> [C], bool_key1 ->]|
|val2|[key1 -> [B], bool_key2 ->, key2 -> [D], bool_key1 ->] |
+----+---------------------------------------------------------+
df_test 数据框架构
root
|-- col1: string (nullable = true)
|-- col2: map (nullable = true)
| |-- key: string
| |-- value: array (valueContainsNull = true)
| | |-- element: string (containsNull = true)
我可以在不更改 python 变量结构的情况下创建数据框有什么帮助吗?
定义架构,不要使用元组定义行。使用列表。试试下面的代码
_test1 = [["val1",{"key1": ["A", "B"], "key2": ["C"], "bool_key1": True, "bool_key2": True}],
["val1",{"key1": ["A", "B"], "key2": ["C"], "bool_key1": True, "bool_key2": True}],
["val2", {"key1": ["B"], "key2": ["D"], "bool_key1": False, "bool_key2": None}]]
df2=spark.createDataFrame(_test1, 'col1 string, col2 struct<key1:array<string>,key2:array<string>,bool_key1:boolean,bool_key1:boolean>')
df2.show(truncate=False)
+----+-------------------------+
|col1|col2 |
+----+-------------------------+
|val1|{[A, B], [C], true, true}|
|val1|{[A, B], [C], true, true}|
|val2|{[B], [D], false, false} |
+----+-------------------------+
root
|-- col1: string (nullable = true)
|-- col2: struct (nullable = true)
| |-- key1: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- key2: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- bool_key1: boolean (nullable = true)
| |-- bool_key1: boolean (nullable = true)
添加到@wwnde 的答案中,还有另一种定义结构模式的方法(尽管个人更喜欢@wwnde 的答案(更少的代码行))-
定义 struct
架构 -
from pyspark.sql.types import *
schema = StructType(
[
StructField("col1", StringType()),
StructField("col2", StructType([
StructField("key1", ArrayType(StringType())),
StructField("key2", ArrayType(StringType())),
StructField("bool_key1", BooleanType()),
StructField("bool_key2", BooleanType())
]
)
)
]
)
正在创建 dataframe
-
_test = [
('val1', {'key1': ['A', 'B'], 'key2': ['C'], 'bool_key1': True, 'bool_key2': True}),
('val2', {'key1': ['B'], 'key2': ['D'], 'bool_key1': False, 'bool_key2': None})
]
df=spark.createDataFrame(data=_test, schema=schema)
df.printSchema()
输出
root
|-- col1: string (nullable = true)
|-- col2: struct (nullable = true)
| |-- key1: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- key2: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- bool_key1: boolean (nullable = true)
| |-- bool_key2: boolean (nullable = true)
如果你想让 MapType
key
value
对完好无损,请尝试使用以下逻辑 -
_test = [
('val1', {'key1': ['A', 'B'], 'key2': ['C'], 'bool_key1': True, 'bool_key2': True}),
('val2', {'key1': ['B'], 'key2': ['D'], 'bool_key1': False, 'bool_key2': None})
]
schema = StructType([
StructField("col1", StringType()),
StructField("col2", (MapType(StringType(), StringType())))
])
spark.createDataFrame(_test, schema=["col1", "col2"]).show(truncate=False)
df_test = spark.createDataFrame(data = _test, schema = schema)
df_test.show(truncate=False)
+----+-------------------------------------------------------------------+
|col1|col2 |
+----+-------------------------------------------------------------------+
|val1|{key1 -> [A, B], bool_key2 -> true, key2 -> [C], bool_key1 -> true}|
|val2|{key1 -> [B], bool_key2 -> null, key2 -> [D], bool_key1 -> false} |
+----+-------------------------------------------------------------------+
df_test.printSchema()
root
|-- col1: string (nullable = true)
|-- col2: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)