Spark DDL 模式 JSON 结构

Spark DDL Schema JSON Struct

问题

我正在尝试在 pyspark 中定义一个嵌套的 .json 模式,但无法使 ddl_schema 字符串起作用。

通常在 SQL 中这将是 ROW,我在下面尝试了 STRUCT 但无法正确获取数据类型这是错误...

ParseException: 
mismatched input '(' expecting {<EOF>, ',', 'COMMENT', NOT}(line 6, pos 15)

== SQL ==

    driverId INT,
    driverRef STRING,
    number STRING,
    code STRING,
    name STRUCT(forename STRING, surname STRING),
---------------^^^
    dob DATE,
    nationality STRING,
    url STRING

数据样本

            +--------+----------+------+----+--------------------+----------+-----------+--------------------+
            |driverId| driverRef|number|code|                name|       dob|nationality|                 url|
            +--------+----------+------+----+--------------------+----------+-----------+--------------------+
            |       1|  hamilton|    44| HAM|   {Lewis, Hamilton}|1985-01-07|    British|http://en.wikiped...|

代码示例

        mnt = "/mnt/dev/root"
        env = "raw"
        path = "formula1/drivers"
        fileFormat = "json"
        
        inPath = f"{mnt}/{env.upper()}/{path}.{fileFormat}"
        
        
        options = {'header': 'True'}
        
        ddl_schema = """
            driverId INT,
            driverRef STRING,
            number STRING,
            code STRING,
            name STRUCT(forename STRING, surname STRING),
            dob DATE,
            nationality STRING,
            url STRING
        """
        
        drivers_df = (spark
                       .read
                       .options(**options)
                       .schema(ddl_schema)
                       .format(fileFormat)
                       .load(inPath)
                     )

您对 STRUCT 使用了错误的语法。
这是正确的:

name STRUCT<forename:STRING,surname:STRING>

https://spark.apache.org/docs/latest/sql-ref-datatypes.html
(搜索 Complex types 并选择 SQL 选项卡)

Data type SQL name
BooleanType BOOLEAN
ByteType BYTE, TINYINT
ShortType SHORT, SMALLINT
IntegerType INT, INTEGER
LongType LONG, BIGINT
FloatType FLOAT, REAL
DoubleType DOUBLE
DateType DATE
TimestampType TIMESTAMP
StringType STRING
BinaryType BINARY
DecimalType DECIMAL, DEC, NUMERIC
YearMonthIntervalType INTERVAL YEAR, INTERVAL YEAR TO MONTH, INTERVAL MONTH
DayTimeIntervalType INTERVAL DAY, INTERVAL DAY TO HOUR, INTERVAL DAY TO MINUTE, INTERVAL DAY TO SECOND, INTERVAL HOUR, INTERVAL HOUR TO MINUTE, INTERVAL HOUR TO SECOND, INTERVAL MINUTE, INTERVAL MINUTE TO SECOND, INTERVAL SECOND
ArrayType ARRAY<element_type>
StructType STRUCT<field1_name: field1_type, field2_name: field2_type, …> Note: ‘:’ is optional.
MapType MAP<key_type, value_type>