pyspark 保存 json 处理结构的空值

pyspark save json handling nulls for struct

这里使用 Pyspark 和 Spark 2.4,Python3。在将数据帧写为 json 文件时,如果结构列为空,我希望将其写为 {},如果结构字段为空,我希望将其写为 ""。例如:

    >>> df.printSchema()
    root
     |-- id: string (nullable = true)
     |-- child1: struct (nullable = true)
     |    |-- f_name: string (nullable = true)
     |    |-- l_name: string (nullable = true)
     |-- child2: struct (nullable = true)
     |    |-- f_name: string (nullable = true)
     |    |-- l_name: string (nullable = true)

     >>> df.show()
    +---+------------+------------+
    | id|      child1|      child2|
    +---+------------+------------+
    |123|[John, Matt]|[Paul, Matt]|
    |111|[Jack, null]|        null|
    |101|        null|        null|
    +---+------------+------------+
    df.fillna("").coalesce(1).write.mode("overwrite").format('json').save('/home/test')

结果:


    {"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}
    {"id":"111","child1":{"f_name":"jack","l_name":""}}
    {"id":"111"}

需要输出:


    {"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}
    {"id":"111","child1":{"f_name":"jack","l_name":""},"child2": {}}
    {"id":"111","child1":{},"child2": {}}

我尝试了一些地图和 udf,但无法实现我的需要。在此感谢您的帮助。

火花3.x

如果您将选项 ignoreNullFields 传递到您的代码中,您将得到这样的输出。不完全是您请求的空结构,但架构仍然正确。

df.fillna("").coalesce(1).write.mode("overwrite").format('json').option('ignoreNullFields', False).save('/home/test')
{"child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"},"id":"123"}
{"child1":{"f_name":"Jack","l_name":null},"child2":null,"id":"111"}
{"child1":null,"child2":null,"id":"101"}

火花2.x

由于上面的选项不存在,我认为有一个“脏修复”,模仿 JSON 结构并绕过 null 检查。同样,结果与您要求的不完全相同,但架构是正确的。

(df
    .select(F.struct(
        F.col('id'),
        F.coalesce(F.col('child1'), F.struct(F.lit(None).alias('f_name'), F.lit(None).alias('l_name'))).alias('child1'),
        F.coalesce(F.col('child2'), F.struct(F.lit(None).alias('f_name'), F.lit(None).alias('l_name'))).alias('child2')
    ).alias('json'))
    .coalesce(1).write.mode("overwrite").format('json').save('/home/test')
)
{"json":{"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}}
{"json":{"id":"111","child1":{"f_name":"Jack"},"child2":{}}}
{"json":{"id":"101","child1":{},"child2":{}}}