Spark 数据框要求 json 文件作为一行中的一个对象?
Spark dataframe requires json file as one object in one line?
我是 spark 的新手,正在尝试使用 spark 来读取 json 文件,就像这样。在 ubuntu18.04 上使用 spark 2.3 和 scala 2.11,java1.8:
猫my.json:
{ "Name":"A", "No_Of_Emp":1, "No_Of_Supervisors":2}
{ "Name":"B", "No_Of_Emp":2, "No_Of_Supervisors":3}
{ "Name":"C", "No_Of_Emp":13,"No_Of_Supervisors":6}
我的 Scala 代码是:
val dir = System.getProperty("user.dir")
val conf = new SparkConf().setAppName("spark sql")
.set("spark.sql.warehouse.dir", dir)
.setMaster("local[4]");
val spark = SparkSession.builder().config(conf).getOrCreate()
val df = spark.read.json("my.json")
df.show()
df.printSchema()
df.select("Name").show()
好的,一切都很好。但是如果我将 json 文件更改为多行,标准 json 格式:
[
{
"Name": "A",
"No_Of_Emp": 1,
"No_Of_Supervisors": 2
},
{
"Name": "B",
"No_Of_Emp": 2,
"No_Of_Supervisors": 3
},
{
"Name": "C",
"No_Of_Emp": 13,
"No_Of_Supervisors": 6
}
]
然后程序会报错:
+--------------------+
| _corrupt_record|
+--------------------+
| [|
| {|
| "Name": "A",|
| "No_Of_Emp"...|
| "No_Of_Supe...|
| },|
| {|
| "Name": "B",|
| "No_Of_Emp"...|
| "No_Of_Supe...|
| },|
| {|
| "Name": "C",|
| "No_Of_Emp"...|
| "No_Of_Supe...|
| }|
| ]|
+--------------------+
root
|-- _corrupt_record: string (nullable = true)
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`Name`' given input columns: [_corrupt_record];;
'Project ['Name]
+- Relation[_corrupt_record#0] json
我想知道为什么会这样?没有双 [] 的 none 标准 json 文件可以工作(一个对象一行),但更标准化的格式 json 将是 "corrupt record"?
我们可以得到一些关于您的问题的信息
Spark SQL can automatically infer the schema of a JSON dataset and
load it as a Dataset[Row]. This conversion can be done using
SparkSession.read.json() on either a Dataset[String], or a JSON file.
Note that the file that is offered as a json file is not a typical
JSON file. Each line must contain a separate, self-contained valid
JSON object. For more information, please see JSON Lines text format,
also called newline-delimited JSON. For a regular multi-line JSON
file, set the multiLine option to true.
因此,如果您想要 运行 它与您的数据多行,请将 multiLine
选项设置为 true
。
示例如下:
val conf = new SparkConf().setAppName("spark sql")
.set("spark.sql.warehouse.dir", dir)
.setMaster("local[*]")
val spark = SparkSession.builder().config(conf).getOrCreate()
val df = spark.read.option("multiLine", true).json("my.json")
df.show()
df.printSchema()
df.select("Name").show()
我是 spark 的新手,正在尝试使用 spark 来读取 json 文件,就像这样。在 ubuntu18.04 上使用 spark 2.3 和 scala 2.11,java1.8:
猫my.json:
{ "Name":"A", "No_Of_Emp":1, "No_Of_Supervisors":2}
{ "Name":"B", "No_Of_Emp":2, "No_Of_Supervisors":3}
{ "Name":"C", "No_Of_Emp":13,"No_Of_Supervisors":6}
我的 Scala 代码是:
val dir = System.getProperty("user.dir")
val conf = new SparkConf().setAppName("spark sql")
.set("spark.sql.warehouse.dir", dir)
.setMaster("local[4]");
val spark = SparkSession.builder().config(conf).getOrCreate()
val df = spark.read.json("my.json")
df.show()
df.printSchema()
df.select("Name").show()
好的,一切都很好。但是如果我将 json 文件更改为多行,标准 json 格式:
[
{
"Name": "A",
"No_Of_Emp": 1,
"No_Of_Supervisors": 2
},
{
"Name": "B",
"No_Of_Emp": 2,
"No_Of_Supervisors": 3
},
{
"Name": "C",
"No_Of_Emp": 13,
"No_Of_Supervisors": 6
}
]
然后程序会报错:
+--------------------+
| _corrupt_record|
+--------------------+
| [|
| {|
| "Name": "A",|
| "No_Of_Emp"...|
| "No_Of_Supe...|
| },|
| {|
| "Name": "B",|
| "No_Of_Emp"...|
| "No_Of_Supe...|
| },|
| {|
| "Name": "C",|
| "No_Of_Emp"...|
| "No_Of_Supe...|
| }|
| ]|
+--------------------+
root
|-- _corrupt_record: string (nullable = true)
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`Name`' given input columns: [_corrupt_record];;
'Project ['Name]
+- Relation[_corrupt_record#0] json
我想知道为什么会这样?没有双 [] 的 none 标准 json 文件可以工作(一个对象一行),但更标准化的格式 json 将是 "corrupt record"?
我们可以得到一些关于您的问题的信息
Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset[Row]. This conversion can be done using SparkSession.read.json() on either a Dataset[String], or a JSON file. Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON. For a regular multi-line JSON file, set the multiLine option to true.
因此,如果您想要 运行 它与您的数据多行,请将 multiLine
选项设置为 true
。
示例如下:
val conf = new SparkConf().setAppName("spark sql")
.set("spark.sql.warehouse.dir", dir)
.setMaster("local[*]")
val spark = SparkSession.builder().config(conf).getOrCreate()
val df = spark.read.option("multiLine", true).json("my.json")
df.show()
df.printSchema()
df.select("Name").show()