Scala - 在将数据帧写为 csv 时如何将定界符作为变量传递
Scala - how to pass delimiter as a variable when writing dataframe as csv
使用变量作为 dataframe.write.csv 的分隔符无效。尝试替代方案太复杂了。
val df = Seq(("a", "b", "c"), ("a1", "b1", "c1")).toDF("A", "B", "C")
val delim_char = "\u001F"
df.coalesce(1).write.option("delimiter", delim_char).csv("file:///var/tmp/test") // Does not work -- error related to too many chars
df.coalesce(1).write.option("delimiter", "\u001F").csv("file:///var/tmp/test") //works fine...
我试过 .toHexString 和许多其他替代方案...
您的声明非常有效。当您提供直接字符串值或传递引用变量时,它适用于两者。只有将分隔符值括在单引号 '\u001F'
中,才会出现字符长度错误。与Scala 2.11.8
.
无关
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://xx.x.xxx.xx:xxxx
Spark context available as 'sc' (master = local[*], app id = local-1535083313716).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0.2.6.3.0-235
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import java.io.File
import java.io.File
scala> import org.apache.spark.sql.{Row, SaveMode, SparkSession}
import org.apache.spark.sql.{Row, SaveMode, SparkSession}
scala> val warehouseLocation = new File("spark-warehouse").getAbsolutePath
warehouseLocation: String = /usr/hdp/2.6.3.0-235/spark2/spark-warehouse
scala> val spark = SparkSession.builder().appName("app").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
18/08/24 00:02:25 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@37d3e740
scala> import spark.implicits._
import spark.implicits._
scala> import spark.sql
import spark.sql
scala> val df = Seq(("a", "b", "c"), ("a1", "b1", "c1")).toDF("A", "B", "C")
df: org.apache.spark.sql.DataFrame = [A: string, B: string ... 1 more field]
scala> val delim_char = "\u001F"
delim_char: String = ""
scala> df.coalesce(1).write.option("delimiter", delim_char).csv("file:///var/tmp/test")
scala>
感谢您的帮助。
上面的代码在测试时有效,但我找不到一种方法来展示问题是如何产生的。然而,问题在于,在从 csv 文件收集后,有一个变量分配给了一个字符串(Unicode“\u001F”,println 将结果显示为字符串:\u001F)。
尝试了几种方法。终于在另一个 ...
中找到了解决方案
1) 无效 -- delim_char.format("unicode-escape")
2) 成功了 --
def unescapeUnicode(str: String): String =
"""\u([0-9a-fA-F]{4})""".r.replaceAllIn(str,
m => Integer.parseInt(m.group(1), 16).toChar.toString)
unescapeUnicode(delim_char)
使用变量作为 dataframe.write.csv 的分隔符无效。尝试替代方案太复杂了。
val df = Seq(("a", "b", "c"), ("a1", "b1", "c1")).toDF("A", "B", "C")
val delim_char = "\u001F"
df.coalesce(1).write.option("delimiter", delim_char).csv("file:///var/tmp/test") // Does not work -- error related to too many chars
df.coalesce(1).write.option("delimiter", "\u001F").csv("file:///var/tmp/test") //works fine...
我试过 .toHexString 和许多其他替代方案...
您的声明非常有效。当您提供直接字符串值或传递引用变量时,它适用于两者。只有将分隔符值括在单引号 '\u001F'
中,才会出现字符长度错误。与Scala 2.11.8
.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://xx.x.xxx.xx:xxxx
Spark context available as 'sc' (master = local[*], app id = local-1535083313716).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0.2.6.3.0-235
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import java.io.File
import java.io.File
scala> import org.apache.spark.sql.{Row, SaveMode, SparkSession}
import org.apache.spark.sql.{Row, SaveMode, SparkSession}
scala> val warehouseLocation = new File("spark-warehouse").getAbsolutePath
warehouseLocation: String = /usr/hdp/2.6.3.0-235/spark2/spark-warehouse
scala> val spark = SparkSession.builder().appName("app").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
18/08/24 00:02:25 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@37d3e740
scala> import spark.implicits._
import spark.implicits._
scala> import spark.sql
import spark.sql
scala> val df = Seq(("a", "b", "c"), ("a1", "b1", "c1")).toDF("A", "B", "C")
df: org.apache.spark.sql.DataFrame = [A: string, B: string ... 1 more field]
scala> val delim_char = "\u001F"
delim_char: String = ""
scala> df.coalesce(1).write.option("delimiter", delim_char).csv("file:///var/tmp/test")
scala>
感谢您的帮助。
上面的代码在测试时有效,但我找不到一种方法来展示问题是如何产生的。然而,问题在于,在从 csv 文件收集后,有一个变量分配给了一个字符串(Unicode“\u001F”,println 将结果显示为字符串:\u001F)。
尝试了几种方法。终于在另一个
1) 无效 -- delim_char.format("unicode-escape")
2) 成功了 --
def unescapeUnicode(str: String): String =
"""\u([0-9a-fA-F]{4})""".r.replaceAllIn(str,
m => Integer.parseInt(m.group(1), 16).toChar.toString)
unescapeUnicode(delim_char)