在代码外使用 json 模式文件创建读取 json 文件的数据框

Question

在 pyspark 中使用单独的 json 模式文件为 json 文件创建数据框的最佳方法是什么？

样本json文件

{"ORIGIN_COUNTRY_NAME":"Romania","DEST_COUNTRY_NAME":"United States","count":1}
{"ORIGIN_COUNTRY_NAME":"Ireland","DEST_COUNTRY_NAME":"United States","count":264}
{"ORIGIN_COUNTRY_NAME":"India","DEST_COUNTRY_NAME":"United States","count":69}
{"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Egypt","count":24}

读取此文件的代码

df_json = spark.read.format("json")\
    .option("mode", "FAILFAST")\
    .option("inferschema", "true")\
    .load("C:\pyspark\data\2010-summary.json")

如果我不想使用“inferschema”选项，而是想使用 json 模式文件，我可以知道该怎么做吗？

json 模式文件

{"$schema": "http://json-schema.org/draft-04/schema#",
  "type": "object",
  "properties": {"ORIGIN_COUNTRY_NAME": {"type": "string"},
                 "DEST_COUNTRY_NAME": {"type": "string"},
                 "count": {"type": "integer"}
                },
  "required": ["ORIGIN_COUNTRY_NAME","DEST_COUNTRY_NAME","count"]
}

Answer 1

选项 1:

我假设你的列都可以为空，

from spark.sql.types import StructType, StructField, StringType, IntegerType


yourSchema = StructType([ StructField("ORIGIN_COUNTRY_NAME", StringType(), True),
                          StructField("DEST_COUNTRY_NAME", StringType(), True),
                          StructField("count", IntegerType(), True),])

选项2：

像这样简单地阅读你的文件..

df_json = spark.read.json("C:\pyspark\data\2010-summary.json")
df_jsonSchema = df_json.schema

print(type(df_jsonSchema))
[each for each in zipsDFSchema]

根据结果，您可以像在选项 1 中一样构建架构。

在代码外使用 json 模式文件创建读取 json 文件的数据框

Create dataframe reading json file using json schema file outside the code

json

apache-spark

apache-spark-sql

pyspark

pyspark-dataframes

选项 1:

选项2：