从打印精美的文本文件加载 spark 数据框
Load spark data frame from a pretty-printed text file
我有几千个json个文件,每个文件里面的内容都类似于下面这样:
{
"1" : { "key":"key1", "val":"val1" },
"2" : { "key":"key2", "val":"val2" },
"3" : { "key":"key3", "val":"val3" }
.
.
.
}
将这些文件加载到 spark 数据帧中的正确方法是什么,因此我将拥有
+------+----------------------------------+
|id | val. |
+------+----------------------------------+
|1 | { "key":"key1", "val":"val1" } |
|2 | { "key":"key2", "val":"val2" } |
|3 | { "key":"key2", "val":"val2" } |
+------+----------------------------------+
我尝试将 json 加载为多行
val df= spark.read.option("multiline","true").json(small_file)
但结果是一行三列
+------------------------+------------------------+----------------+
|1 |2 |3 |
+------------------------+------------------------+----------------+
|{ "key":"key1", "val..} ||{"key":"key2", "val..} |{"key":"key3"...|
+------------------------+------------------------+----------------+
我所做的也是将文件加载到地图中
val keys = df.columns
val values = df.collect().last.toSeq
val myMap = keys.zip(values).toMap
println(myMap)
// output
// Map(1-> [key1, val1], 2-> [key2, val2], 3-> [key3, val3])
但我不知道如何从这个地图创建数据框
这是一个多行 JSON 文件,您可以像这样指定 multiline
选项来读取此类文件:
val spark = SparkSession
.builder()
.appName("JSONReader")
.master("local")
.getOrCreate()
val multiline_df = spark.read.option("multiline","true")
.json("multiline-file.json")
multiline_df.show(false)
结果将是这样的:
[info] +------------+------------+------------+
[info] |1 |2 |3 |
[info] +------------+------------+------------+
[info] |[key1, val1]|[key2, val2]|[key3, val3]|
[info] +------------+------------+------------+
[info]
我能够通过以下步骤获得结果:
如问题中所述,加载后的结果 df 将如下所示
+------------------------+------------------------+----------------+
|1 |2 |3 |
+------------------------+------------------------+----------------+
|{ "key":"key1", "val..} ||{"key":"key2", "val..} |{"key":"key3"...|
+------------------------+------------------------+----------------+
1- 将列转换为字符串
val cols=df.columns.map(x => col(s"${x}").cast("string").alias(s"${x}"))
2- 创建列字符串
val str_cols=df.columns.mkString(",")
3- 使用步骤 1 中的转换值创建一个新的 df
val df1 = df.withColumn("temp",
explode(arrays_zip(array(cols:_*),
split(lit(str_cols),","))))
.select("temp.*")
.toDF("vals","index")
4- 生成的数据框将与所需的一样
df1.select($"index",$"vals").show()
+------+----------------------------------+
|index | vals |
+------+----------------------------------+
|1 | { "key":"key1", "val":"val1" } |
|2 | { "key":"key2", "val":"val2" } |
|3 | { "key":"key2", "val":"val2" } |
+------+----------------------------------+
我有几千个json个文件,每个文件里面的内容都类似于下面这样:
{
"1" : { "key":"key1", "val":"val1" },
"2" : { "key":"key2", "val":"val2" },
"3" : { "key":"key3", "val":"val3" }
.
.
.
}
将这些文件加载到 spark 数据帧中的正确方法是什么,因此我将拥有
+------+----------------------------------+
|id | val. |
+------+----------------------------------+
|1 | { "key":"key1", "val":"val1" } |
|2 | { "key":"key2", "val":"val2" } |
|3 | { "key":"key2", "val":"val2" } |
+------+----------------------------------+
我尝试将 json 加载为多行
val df= spark.read.option("multiline","true").json(small_file)
但结果是一行三列
+------------------------+------------------------+----------------+
|1 |2 |3 |
+------------------------+------------------------+----------------+
|{ "key":"key1", "val..} ||{"key":"key2", "val..} |{"key":"key3"...|
+------------------------+------------------------+----------------+
我所做的也是将文件加载到地图中
val keys = df.columns
val values = df.collect().last.toSeq
val myMap = keys.zip(values).toMap
println(myMap)
// output
// Map(1-> [key1, val1], 2-> [key2, val2], 3-> [key3, val3])
但我不知道如何从这个地图创建数据框
这是一个多行 JSON 文件,您可以像这样指定 multiline
选项来读取此类文件:
val spark = SparkSession
.builder()
.appName("JSONReader")
.master("local")
.getOrCreate()
val multiline_df = spark.read.option("multiline","true")
.json("multiline-file.json")
multiline_df.show(false)
结果将是这样的:
[info] +------------+------------+------------+
[info] |1 |2 |3 |
[info] +------------+------------+------------+
[info] |[key1, val1]|[key2, val2]|[key3, val3]|
[info] +------------+------------+------------+
[info]
我能够通过以下步骤获得结果:
如问题中所述,加载后的结果 df 将如下所示
+------------------------+------------------------+----------------+
|1 |2 |3 |
+------------------------+------------------------+----------------+
|{ "key":"key1", "val..} ||{"key":"key2", "val..} |{"key":"key3"...|
+------------------------+------------------------+----------------+
1- 将列转换为字符串
val cols=df.columns.map(x => col(s"${x}").cast("string").alias(s"${x}"))
2- 创建列字符串
val str_cols=df.columns.mkString(",")
3- 使用步骤 1 中的转换值创建一个新的 df
val df1 = df.withColumn("temp",
explode(arrays_zip(array(cols:_*),
split(lit(str_cols),","))))
.select("temp.*")
.toDF("vals","index")
4- 生成的数据框将与所需的一样
df1.select($"index",$"vals").show()
+------+----------------------------------+
|index | vals |
+------+----------------------------------+
|1 | { "key":"key1", "val":"val1" } |
|2 | { "key":"key2", "val":"val2" } |
|3 | { "key":"key2", "val":"val2" } |
+------+----------------------------------+