Scala - 在将数据帧写为 csv 时如何将定界符作为变量传递

Scala - how to pass delimiter as a variable when writing dataframe as csv

使用变量作为 dataframe.write.csv 的分隔符无效。尝试替代方案太复杂了。

 val df = Seq(("a", "b", "c"), ("a1", "b1", "c1")).toDF("A", "B", "C")
 val delim_char = "\u001F"

 df.coalesce(1).write.option("delimiter", delim_char).csv("file:///var/tmp/test")  // Does not work -- error related to too many chars
 df.coalesce(1).write.option("delimiter", "\u001F").csv("file:///var/tmp/test")  //works fine...

我试过 .toHexString 和许多其他替代方案...

您的声明非常有效。当您提供直接字符串值或传递引用变量时,它适用于两者。只有将分隔符值括在单引号 '\u001F' 中,才会出现字符长度错误。与Scala 2.11.8.

无关
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://xx.x.xxx.xx:xxxx
Spark context available as 'sc' (master = local[*], app id = local-1535083313716).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0.2.6.3.0-235
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import java.io.File
import java.io.File

scala> import org.apache.spark.sql.{Row, SaveMode, SparkSession}
import org.apache.spark.sql.{Row, SaveMode, SparkSession}

scala> val warehouseLocation = new File("spark-warehouse").getAbsolutePath
warehouseLocation: String = /usr/hdp/2.6.3.0-235/spark2/spark-warehouse

scala> val spark = SparkSession.builder().appName("app").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
18/08/24 00:02:25 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@37d3e740

scala> import spark.implicits._
import spark.implicits._

scala> import spark.sql
import spark.sql

scala> val df = Seq(("a", "b", "c"), ("a1", "b1", "c1")).toDF("A", "B", "C")
df: org.apache.spark.sql.DataFrame = [A: string, B: string ... 1 more field]

scala> val delim_char = "\u001F"
delim_char: String = ""

scala> df.coalesce(1).write.option("delimiter", delim_char).csv("file:///var/tmp/test")

scala>

感谢您的帮助。

上面的代码在测试时有效,但我找不到一种方法来展示问题是如何产生的。然而,问题在于,在从 csv 文件收集后,有一个变量分配给了一个字符串(Unicode“\u001F”,println 将结果显示为字符串:\u001F)。

尝试了几种方法。终于在另一个 ...

中找到了解决方案

1) 无效 -- delim_char.format("unicode-escape")

2) 成功了 --

def unescapeUnicode(str: String): String = 
     """\u([0-9a-fA-F]{4})""".r.replaceAllIn(str, 
     m => Integer.parseInt(m.group(1), 16).toChar.toString)

unescapeUnicode(delim_char)