使用 spark 读取 CSV 时,等同于 ^G 的分隔符是什么?
What's the delimiter equivalent of ^G when reading a CSV with spark?
所以,我真的需要帮助做一件愚蠢的事情,但显然我无法独自完成。
我在文件中有一组采用这种格式的行(在 OSX 上用 less
读取):
XXXXXXXX^GT^XXXXXXXX^G\N^G0^GDL^G\N^G2018-09-14 13:57:00.0^G2018-09-16 00:00:00.0^GCompleted^G\N^G\N^G1^G2018-09-16 21:41:02.267^G1^G2018-09-16 21:41:02.267^GXXXXXXX^G\N
YYYYYYYY^GS^XXXXXXXX^G\N^G0^GDL^G\N^G2018-08-29 00:00:00.0^G2018-08-29 23:00:00.0^GCompleted^G\N^G\N^G1^G2018-09-16 21:41:03.797^G1^G2018-09-16 21:41:03.81^GXXXXXXX^G\N
所以分隔符是 BEL
分隔符,我以这种方式加载 CSV:
val df = sqlContext.read.format("csv")
.option("header", "false")
.option("inferSchema", "true")
.option("delimiter", "\u2407")
.option("nullValue", "\N")
.load("part0000")
但是当我阅读它时,它只是以这种方式将行读取为一列:
XXXXXXXXCXXXXXXXX\N0DL\N2018-09-15 00:00:00.02018-09-16 00:00:00.0Completed\N\N12018-09-16 21:41:03.25712018-09-16 21:41:03.263XXXXXXXX\N
XXXXXXXXSXXXXXXXX\N0DL\N2018-09-15 00:00:00.02018-09-15 23:00:00.0Completed\N\N12018-09-16 21:41:03.3712018-09-16 21:41:03.373XXXXXXXX\N
似乎有一个 unkown character
(你什么也看不到,因为我在 Whosebug 上对它进行了格式化)代替了 ^G
。
更新:
这可能是对 scala 的 spark 的限制吗?
如果我 运行 以这种方式使用 scala 的代码:
val df = sqlContext.read.format("csv")
.option("header", "false")
.option("inferSchema", "true")
.option("delimiter", "\a")
.option("nullValue", "\N")
.load("part-m-00000")
display(df)
我长胖了
java.lang.IllegalArgumentException: Unsupported special character for delimiter: \a
而如果我 运行 和 python:
df = sqlContext.read.format('csv').options(header='false', inferSchema='true', delimiter = "\a", nullValue = '\N').load('part-m-00000')
display(df)
一切都很好!
在 spark-scala 中这些版本看起来有限制,这是代码中支持的 csv 分隔符,
apache/spark/sql/catalyst/csv/CSVOptions.scala
val delimiter = CSVExprUtils.toChar(
parameters.getOrElse("sep", parameters.getOrElse("delimiter", ",")))
--- CSVExprUtils.toChar
apache/spark/sql/catalyst/csv/CSVExprUtils.scala
def toChar(str: String): Char = {
(str: Seq[Char]) match {
case Seq() => throw new IllegalArgumentException("Delimiter cannot be empty string")
case Seq('\') => throw new IllegalArgumentException("Single backslash is prohibited." +
" It has special meaning as beginning of an escape sequence." +
" To get the backslash character, pass a string with two backslashes as the delimiter.")
case Seq(c) => c
case Seq('\', 't') => '\t'
case Seq('\', 'r') => '\r'
case Seq('\', 'b') => '\b'
case Seq('\', 'f') => '\f'
// In case user changes quote char and uses \" as delimiter in options
case Seq('\', '\"') => '\"'
case Seq('\', '\'') => '\''
case Seq('\', '\') => '\'
case _ if str == """\u0000""" => '\u0000'
case Seq('\', _) =>
throw new IllegalArgumentException(s"Unsupported special character for delimiter: $str")
case _ =>
throw new IllegalArgumentException(s"Delimiter cannot be more than one character: $str")
}
所以,我真的需要帮助做一件愚蠢的事情,但显然我无法独自完成。
我在文件中有一组采用这种格式的行(在 OSX 上用 less
读取):
XXXXXXXX^GT^XXXXXXXX^G\N^G0^GDL^G\N^G2018-09-14 13:57:00.0^G2018-09-16 00:00:00.0^GCompleted^G\N^G\N^G1^G2018-09-16 21:41:02.267^G1^G2018-09-16 21:41:02.267^GXXXXXXX^G\N
YYYYYYYY^GS^XXXXXXXX^G\N^G0^GDL^G\N^G2018-08-29 00:00:00.0^G2018-08-29 23:00:00.0^GCompleted^G\N^G\N^G1^G2018-09-16 21:41:03.797^G1^G2018-09-16 21:41:03.81^GXXXXXXX^G\N
所以分隔符是 BEL
分隔符,我以这种方式加载 CSV:
val df = sqlContext.read.format("csv")
.option("header", "false")
.option("inferSchema", "true")
.option("delimiter", "\u2407")
.option("nullValue", "\N")
.load("part0000")
但是当我阅读它时,它只是以这种方式将行读取为一列:
XXXXXXXXCXXXXXXXX\N0DL\N2018-09-15 00:00:00.02018-09-16 00:00:00.0Completed\N\N12018-09-16 21:41:03.25712018-09-16 21:41:03.263XXXXXXXX\N
XXXXXXXXSXXXXXXXX\N0DL\N2018-09-15 00:00:00.02018-09-15 23:00:00.0Completed\N\N12018-09-16 21:41:03.3712018-09-16 21:41:03.373XXXXXXXX\N
似乎有一个 unkown character
(你什么也看不到,因为我在 Whosebug 上对它进行了格式化)代替了 ^G
。
更新: 这可能是对 scala 的 spark 的限制吗? 如果我 运行 以这种方式使用 scala 的代码:
val df = sqlContext.read.format("csv")
.option("header", "false")
.option("inferSchema", "true")
.option("delimiter", "\a")
.option("nullValue", "\N")
.load("part-m-00000")
display(df)
我长胖了
java.lang.IllegalArgumentException: Unsupported special character for delimiter: \a
而如果我 运行 和 python:
df = sqlContext.read.format('csv').options(header='false', inferSchema='true', delimiter = "\a", nullValue = '\N').load('part-m-00000')
display(df)
一切都很好!
在 spark-scala 中这些版本看起来有限制,这是代码中支持的 csv 分隔符,
apache/spark/sql/catalyst/csv/CSVOptions.scala
val delimiter = CSVExprUtils.toChar(
parameters.getOrElse("sep", parameters.getOrElse("delimiter", ",")))
--- CSVExprUtils.toChar
apache/spark/sql/catalyst/csv/CSVExprUtils.scala
def toChar(str: String): Char = {
(str: Seq[Char]) match {
case Seq() => throw new IllegalArgumentException("Delimiter cannot be empty string")
case Seq('\') => throw new IllegalArgumentException("Single backslash is prohibited." +
" It has special meaning as beginning of an escape sequence." +
" To get the backslash character, pass a string with two backslashes as the delimiter.")
case Seq(c) => c
case Seq('\', 't') => '\t'
case Seq('\', 'r') => '\r'
case Seq('\', 'b') => '\b'
case Seq('\', 'f') => '\f'
// In case user changes quote char and uses \" as delimiter in options
case Seq('\', '\"') => '\"'
case Seq('\', '\'') => '\''
case Seq('\', '\') => '\'
case _ if str == """\u0000""" => '\u0000'
case Seq('\', _) =>
throw new IllegalArgumentException(s"Unsupported special character for delimiter: $str")
case _ =>
throw new IllegalArgumentException(s"Delimiter cannot be more than one character: $str")
}