PySpark，通过 JSON 文件导入模式

Question

tbschema.json 看起来像这样：

[{"TICKET":"integer","TRANFERRED":"string","ACCOUNT":"STRING"}]

我使用以下代码加载它

>>> df2 = sqlContext.jsonFile("tbschema.json")
>>> f2.schema
StructType(List(StructField(ACCOUNT,StringType,true),
    StructField(TICKET,StringType,true),StructField(TRANFERRED,StringType,true)))
>>> df2.printSchema()
root
 |-- ACCOUNT: string (nullable = true)
 |-- TICKET: string (nullable = true)
 |-- TRANFERRED: string (nullable = true)

当我希望元素的顺序与它们在 JSON.
JSON推导后数据类型integer转成StringType，如何保留数据类型

Answer 1

Why does the schema elements gets sorted, when i want the elemets in the same order as they appear in the json.

因为无法保证字段的顺序。虽然没有明确说明，但当您查看 JSON reader 文档字符串中提供的示例时，它就会变得显而易见。如果您需要特定的顺序，您可以手动提供模式：

from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([
    StructField("TICKET", StringType(), True),
    StructField("TRANFERRED", StringType(), True),
    StructField("ACCOUNT", StringType(), True),
])
df2 = sqlContext.read.json("tbschema.json", schema)
df2.printSchema()

root
 |-- TICKET: string (nullable = true)
 |-- TRANFERRED: string (nullable = true)
 |-- ACCOUNT: string (nullable = true)

The data type integer has been converted into StringType after the json has been derived, how do i retain the datatype.

JSON 字段 TICKET 的数据类型是字符串，因此 JSON reader returns 字符串。它是 JSON reader 而不是某种模式 reader。

一般来说，您应该考虑一些开箱即用的模式支持的适当格式，例如 Parquet, Avro or Protocol Buffers。但是如果你真的想玩 JSON 你可以像这样定义穷人的 "schema" 解析器：

from collections import OrderedDict 
import json

with open("./tbschema.json") as fr:
    ds = fr.read()

items = (json
  .JSONDecoder(object_pairs_hook=OrderedDict)
  .decode(ds)[0].items())

mapping = {"string": StringType, "integer": IntegerType, ...}

schema = StructType([
    StructField(k, mapping.get(v.lower())(), True) for (k, v) in items])

JSON 的问题在于，对于字段排序确实没有任何保证，更不用说处理丢失的字段、不一致的类型等等。因此，使用上述解决方案实际上取决于您对数据的信任程度。

或者您可以使用 built-in schema import / export utilities。

PySpark，通过 JSON 文件导入模式

PySpark, importing schema through JSON file

python

json

apache-spark

apache-spark-sql

pyspark