如何解析数据集 apache spark java 中的多行 json

Question

有什么方法可以使用 Dataset 解析多行 json 文件这是示例代码

public static void main(String[] args) {

    // creating spark session
    SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example")
                .config("spark.some.config.option", "some-value").getOrCreate();

    Dataset<Row> df = spark.read().json("D:/sparktestio/input.json");
    df.show();
}

如果 json 在单行中，它工作得很好，但我需要它用于多行

我的 json 文件

{
  "name": "superman",
  "age": "unknown",
  "height": "6.2",
  "weight": "flexible"
}

Answer 1

上次我查看 Spark SQL 文档时，这个很突出：

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

我过去能够通过使用生成 PairRDD 的 Spark 上下文 wholeTextFiles 方法加载 JSON 来解决这个问题。

请参阅本页 "Spark SQL JSON Example Tutorial Part 2" 部分中的完整示例 https://www.supergloo.com/fieldnotes/spark-sql-json-examples/

Answer 2

    SparkSession spark = SparkSession.builder().appName("Java Spark Hive Example")
            .config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate();

    JavaRDD<Tuple2<String, String>> javaRDD = spark.sparkContext().wholeTextFiles(filePath, 1).toJavaRDD();

    List<Tuple2<String, String>> collect = javaRDD.collect();
    System.out.println("everything =  " + everything);

Answer 3

apache spark 文档清楚地提到了这一点 -

对于常规 multi-line JSON 文件，将 multiLine 选项设置为 true。

因此，解决方案是

Dataset<Row> df = spark.read().option("multiLine", true).json("file:/a/b/c.json");
df.show();

我尝试过使用相同格式的 json（单个 json object 跨多行）。添加选项后，我再也看不到带有 corrupted_record header 的结果。

如何解析数据集 apache spark java 中的多行 json

How to parse a multiline json in dataset apache spark java

java

json

hadoop

apache-spark

apache-spark-dataset