SPARK read.json throwing java.io.IOException: 换行前字节太多

SPARK read.json throwing java.io.IOException: Too many bytes before newline

我在读取一个大的 6gb 单行 json 文件时遇到以下错误:

Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.io.IOException: Too many bytes before newline: 2147483648

spark 不读取带有新行的 json 文件,因此整个 6 GB json 文件在一行中:

jf = sqlContext.read.json("jlrn2.json")

配置:

spark.driver.memory 20g

是的,您的行中有超过 Integer.MAX_VALUE 个字节。你需要把它分开。

请记住,Spark 期望每一行都是有效的 JSON 文档,而不是整个文件。以下是 Spark SQL Progamming Guide

中的相关行

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

因此,如果您的 JSON 文档格式为...

[
  { [record] },
  { [record] }
]

您需要将其更改为

{ [record] }
{ [record] }

我在 PySpark 中阅读一个巨大的 JSON 文件并遇到同样的错误时偶然发现了这一点。所以,如果其他人也想知道如何以 PySpark 可以正确读取的格式保存 JSON 文件,这里是一个使用 pandas:

的简单示例
import pandas as pd
from collections import dict

# create some dict you want to dump
list_of_things_to_dump = [1, 2, 3, 4, 5]
dump_dict = defaultdict(list)
for number in list_of_things_to_dump:
    dump_dict["my_number"].append(number)

# save data like this using pandas, will work of the bat with PySpark
output_df = pd.DataFrame.from_dict(dump_dict)
with open('my_fancy_json.json', 'w') as f:
    f.write(output_df.to_json(orient='records', lines=True))

之后在 PySpark 中加载 JSON 就像:

df = spark.read.json("hdfs:///user/best_user/my_fancy_json.json", schema=schema)