在 Spark 中读取最后一列作为值数组的 CSV(值在括号内并用逗号分隔)
Read CSV with last column as array of values (and the values are inside parenthesis and separated by comma) in Spark
我有一个 CSV 文件,其中最后一列在括号内,值用逗号分隔。最后一列中值的数量是可变的。当我将它们读取为具有如下一些列名的 Dataframe 时,我得到 Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match
。我的 CSV 文件如下所示
a1,b1,true,2017-05-16T07:00:41.0000000,2.5,(c1,d1,e1)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2,f2,g2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2,k2,f2)
我最终想要的是这样的:
root
|-- MId: string (nullable = true)
|-- PId: string (nullable = true)
|-- IsTeacher: boolean(nullable = true)
|-- STime: datetype(nullable = true)
|-- TotalMinutes: double(nullable = true)
|-- SomeArrayHeader: array<string>(nullable = true)
到目前为止,我已经编写了以下代码:
val infoDF =
sqlContext.read.format("csv")
.option("header", "false")
.load(inputPath)
.toDF(
"MId",
"PId",
"IsTeacher",
"STime",
"TotalMinutes",
"SomeArrayHeader")
我想在不给出列名的情况下阅读它们,然后将第 5 列之后的列转换为数组类型。但是后来我遇到了括号问题。有没有一种方法可以在阅读和告知括号内的字段实际上是数组类型的一个字段时执行此操作。
好的。该解决方案仅适用于您的情况。下面的一个对我有用
val df = spark.read.option("quote", "(").csv("in/staff.csv").toDF(
"MId",
"PId",
"IsTeacher",
"STime",
"TotalMinutes",
"arr")
df.show()
val df2 = df.withColumn("arr",split(regexp_replace('arr,"[)]",""),","))
df2.printSchema()
df2.show()
输出:
+---+---+---------+--------------------+------------+---------------+
|MId|PId|IsTeacher| STime|TotalMinutes| arr|
+---+---+---------+--------------------+------------+---------------+
| a1| b1| true|2017-05-16T07:00:...| 2.5| c1,d1,e1)|
| a2| b2| true|2017-05-26T07:00:...| 0.5|c2,d2,e2,f2,g2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2,d2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2,d2,e2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5|c2,d2,e2,k2,f2)|
+---+---+---------+--------------------+------------+---------------+
root
|-- MId: string (nullable = true)
|-- PId: string (nullable = true)
|-- IsTeacher: string (nullable = true)
|-- STime: string (nullable = true)
|-- TotalMinutes: string (nullable = true)
|-- arr: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---+---------+--------------------+------------+--------------------+
|MId|PId|IsTeacher| STime|TotalMinutes| arr|
+---+---+---------+--------------------+------------+--------------------+
| a1| b1| true|2017-05-16T07:00:...| 2.5| [c1, d1, e1]|
| a2| b2| true|2017-05-26T07:00:...| 0.5|[c2, d2, e2, f2, g2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2, d2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2, d2, e2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5|[c2, d2, e2, k2, f2]|
+---+---+---------+--------------------+------------+--------------------+
我有一个 CSV 文件,其中最后一列在括号内,值用逗号分隔。最后一列中值的数量是可变的。当我将它们读取为具有如下一些列名的 Dataframe 时,我得到 Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match
。我的 CSV 文件如下所示
a1,b1,true,2017-05-16T07:00:41.0000000,2.5,(c1,d1,e1)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2,f2,g2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2,k2,f2)
我最终想要的是这样的:
root
|-- MId: string (nullable = true)
|-- PId: string (nullable = true)
|-- IsTeacher: boolean(nullable = true)
|-- STime: datetype(nullable = true)
|-- TotalMinutes: double(nullable = true)
|-- SomeArrayHeader: array<string>(nullable = true)
到目前为止,我已经编写了以下代码:
val infoDF =
sqlContext.read.format("csv")
.option("header", "false")
.load(inputPath)
.toDF(
"MId",
"PId",
"IsTeacher",
"STime",
"TotalMinutes",
"SomeArrayHeader")
我想在不给出列名的情况下阅读它们,然后将第 5 列之后的列转换为数组类型。但是后来我遇到了括号问题。有没有一种方法可以在阅读和告知括号内的字段实际上是数组类型的一个字段时执行此操作。
好的。该解决方案仅适用于您的情况。下面的一个对我有用
val df = spark.read.option("quote", "(").csv("in/staff.csv").toDF(
"MId",
"PId",
"IsTeacher",
"STime",
"TotalMinutes",
"arr")
df.show()
val df2 = df.withColumn("arr",split(regexp_replace('arr,"[)]",""),","))
df2.printSchema()
df2.show()
输出:
+---+---+---------+--------------------+------------+---------------+
|MId|PId|IsTeacher| STime|TotalMinutes| arr|
+---+---+---------+--------------------+------------+---------------+
| a1| b1| true|2017-05-16T07:00:...| 2.5| c1,d1,e1)|
| a2| b2| true|2017-05-26T07:00:...| 0.5|c2,d2,e2,f2,g2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2,d2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2,d2,e2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5|c2,d2,e2,k2,f2)|
+---+---+---------+--------------------+------------+---------------+
root
|-- MId: string (nullable = true)
|-- PId: string (nullable = true)
|-- IsTeacher: string (nullable = true)
|-- STime: string (nullable = true)
|-- TotalMinutes: string (nullable = true)
|-- arr: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---+---------+--------------------+------------+--------------------+
|MId|PId|IsTeacher| STime|TotalMinutes| arr|
+---+---+---------+--------------------+------------+--------------------+
| a1| b1| true|2017-05-16T07:00:...| 2.5| [c1, d1, e1]|
| a2| b2| true|2017-05-26T07:00:...| 0.5|[c2, d2, e2, f2, g2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2, d2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2, d2, e2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5|[c2, d2, e2, k2, f2]|
+---+---+---------+--------------------+------------+--------------------+