读取 spark 1.6 中的编码值抛出错误

Question

我正在接收来自 API 的文件，该文件在 3 列中有一个编码（非 ascii）字符值。当我在 Spark1.6

中使用 DataFrame 读取文件时

val CleanData= sqlContext.sql("""SELECT
                                               COL1
                                               COL2,
                                               COL3
                                               FROM CLEANFRAME
                                               """ )

编码值如下所示。

但是编码值看起来像

53004, ��

如果可能的话，请有人帮助我解决这个错误，如果可以使用 spark 1.6 和 scala。 Spark 1.6，斯卡拉

Answer 1

#this ca be achieved by using the regex_replace
    val df = spark.sparkContext.parallelize(List(("503004","d$üíõ$F|'.h*Ë!øì=(.î;      ,.¡|®!®","3-2-704"))).toDF("col1","col2","col3")
    df.withColumn("col2_new", regexp_replace($"col2", "[^a-zA-Z]", "")).show()    
Output:
+------+--------------------+-------+--------+
|  col1|                col2|   col3|col2_new|
+------+--------------------+-------+--------+
|503004|d$üíõ$F|'.h*Ë!øì=...|3-2-704|     dFh|
+------+--------------------+-------+--------+

读取 spark 1.6 中的编码值抛出错误

Reading Encoded value in spark 1.6 throwing Error

scala

apache-spark

apache-spark-sql

apache-spark-1.6