Spark DDL 模式 JSON 结构
Spark DDL Schema JSON Struct
问题
我正在尝试在 pyspark 中定义一个嵌套的 .json 模式,但无法使 ddl_schema 字符串起作用。
通常在 SQL 中这将是 ROW,我在下面尝试了 STRUCT 但无法正确获取数据类型这是错误...
ParseException:
mismatched input '(' expecting {<EOF>, ',', 'COMMENT', NOT}(line 6, pos 15)
== SQL ==
driverId INT,
driverRef STRING,
number STRING,
code STRING,
name STRUCT(forename STRING, surname STRING),
---------------^^^
dob DATE,
nationality STRING,
url STRING
数据样本
+--------+----------+------+----+--------------------+----------+-----------+--------------------+
|driverId| driverRef|number|code| name| dob|nationality| url|
+--------+----------+------+----+--------------------+----------+-----------+--------------------+
| 1| hamilton| 44| HAM| {Lewis, Hamilton}|1985-01-07| British|http://en.wikiped...|
代码示例
mnt = "/mnt/dev/root"
env = "raw"
path = "formula1/drivers"
fileFormat = "json"
inPath = f"{mnt}/{env.upper()}/{path}.{fileFormat}"
options = {'header': 'True'}
ddl_schema = """
driverId INT,
driverRef STRING,
number STRING,
code STRING,
name STRUCT(forename STRING, surname STRING),
dob DATE,
nationality STRING,
url STRING
"""
drivers_df = (spark
.read
.options(**options)
.schema(ddl_schema)
.format(fileFormat)
.load(inPath)
)
您对 STRUCT 使用了错误的语法。
这是正确的:
name STRUCT<forename:STRING,surname:STRING>
https://spark.apache.org/docs/latest/sql-ref-datatypes.html
(搜索 Complex types
并选择 SQL 选项卡)
Data type
SQL name
BooleanType
BOOLEAN
ByteType
BYTE, TINYINT
ShortType
SHORT, SMALLINT
IntegerType
INT, INTEGER
LongType
LONG, BIGINT
FloatType
FLOAT, REAL
DoubleType
DOUBLE
DateType
DATE
TimestampType
TIMESTAMP
StringType
STRING
BinaryType
BINARY
DecimalType
DECIMAL, DEC, NUMERIC
YearMonthIntervalType
INTERVAL YEAR, INTERVAL YEAR TO MONTH, INTERVAL MONTH
DayTimeIntervalType
INTERVAL DAY, INTERVAL DAY TO HOUR, INTERVAL DAY TO MINUTE, INTERVAL DAY TO SECOND, INTERVAL HOUR, INTERVAL HOUR TO MINUTE, INTERVAL HOUR TO SECOND, INTERVAL MINUTE, INTERVAL MINUTE TO SECOND, INTERVAL SECOND
ArrayType
ARRAY<element_type>
StructType
STRUCT<field1_name: field1_type, field2_name: field2_type, …> Note: ‘:’ is optional.
MapType
MAP<key_type, value_type>
问题
我正在尝试在 pyspark 中定义一个嵌套的 .json 模式,但无法使 ddl_schema 字符串起作用。
通常在 SQL 中这将是 ROW,我在下面尝试了 STRUCT 但无法正确获取数据类型这是错误...
ParseException:
mismatched input '(' expecting {<EOF>, ',', 'COMMENT', NOT}(line 6, pos 15)
== SQL ==
driverId INT,
driverRef STRING,
number STRING,
code STRING,
name STRUCT(forename STRING, surname STRING),
---------------^^^
dob DATE,
nationality STRING,
url STRING
数据样本
+--------+----------+------+----+--------------------+----------+-----------+--------------------+
|driverId| driverRef|number|code| name| dob|nationality| url|
+--------+----------+------+----+--------------------+----------+-----------+--------------------+
| 1| hamilton| 44| HAM| {Lewis, Hamilton}|1985-01-07| British|http://en.wikiped...|
代码示例
mnt = "/mnt/dev/root"
env = "raw"
path = "formula1/drivers"
fileFormat = "json"
inPath = f"{mnt}/{env.upper()}/{path}.{fileFormat}"
options = {'header': 'True'}
ddl_schema = """
driverId INT,
driverRef STRING,
number STRING,
code STRING,
name STRUCT(forename STRING, surname STRING),
dob DATE,
nationality STRING,
url STRING
"""
drivers_df = (spark
.read
.options(**options)
.schema(ddl_schema)
.format(fileFormat)
.load(inPath)
)
您对 STRUCT 使用了错误的语法。
这是正确的:
name STRUCT<forename:STRING,surname:STRING>
https://spark.apache.org/docs/latest/sql-ref-datatypes.html
(搜索 Complex types
并选择 SQL 选项卡)
Data type | SQL name |
---|---|
BooleanType | BOOLEAN |
ByteType | BYTE, TINYINT |
ShortType | SHORT, SMALLINT |
IntegerType | INT, INTEGER |
LongType | LONG, BIGINT |
FloatType | FLOAT, REAL |
DoubleType | DOUBLE |
DateType | DATE |
TimestampType | TIMESTAMP |
StringType | STRING |
BinaryType | BINARY |
DecimalType | DECIMAL, DEC, NUMERIC |
YearMonthIntervalType | INTERVAL YEAR, INTERVAL YEAR TO MONTH, INTERVAL MONTH |
DayTimeIntervalType | INTERVAL DAY, INTERVAL DAY TO HOUR, INTERVAL DAY TO MINUTE, INTERVAL DAY TO SECOND, INTERVAL HOUR, INTERVAL HOUR TO MINUTE, INTERVAL HOUR TO SECOND, INTERVAL MINUTE, INTERVAL MINUTE TO SECOND, INTERVAL SECOND |
ArrayType | ARRAY<element_type> |
StructType | STRUCT<field1_name: field1_type, field2_name: field2_type, …> Note: ‘:’ is optional. |
MapType | MAP<key_type, value_type> |