Scala 将列的十六进制子字符串转换为十进制 - Dataframe org.apache.spark.sql.catalyst.parser.ParseException：

Question

   val DF = Seq("310:120:fe5ab02").toDF("id")

+-----------------+
|       id        |
+-----------------+
| 310:120:fe5ab02 |
+-----------------+


+-----------------+-------------+--------+
|       id        |      id1    |   id2  |
+-----------------+-------------+--------+
| 310:120:fe5ab02 |      2      | 1041835| 
+-----------------+-------------+--------+

我需要将列中字符串的两个子字符串从十六进制转换为十进制，并在 Dataframe 中创建两个新列。

id1->   310:120:fe5ab02 ->x.(split(":")(2)) -> fe5ab02 -> substring(5) -> 02 -> ParseInt(x,16) ->  2
id2->   310:120:fe5ab02 ->x.(split(":")(2)) -> fe5ab02 -> substring(0,5) -> fe5ab -> ParseInt(x,16) ->  1041835

从“310:120:fe5ab02”我需要“fe5ab02”，我通过 x.split(“:”)(2) 然后我需要两个子串“fe5ab”和“02”，我通过 x.substring(0,5),x.substring(5) 然后我需要将它们转换成我通过 Integer.parseInt(x,16)

得到的十进制

这些单独使用效果很好，但我需要在单个 withColumn 语句中使用它们，如下所示

val DF1 = DF
.withColumn("id1", expr("""Integer.parseInt((id.split(":")(2)).substring(5), 16)"""))
.withColumn("id2", expr("""Integer.parseInt((id.split(":")(2)).substring(0, 5), 16)"""))

display(DF1)

我遇到解析异常。

Answer 1

case class SplitId(part1: Int, part2: Int)

def splitHex: (String => SplitId) = { s => {
    val str: String = s.split(":")(2)
    SplitId(Integer.parseInt(str.substring(5), 16), Integer.parseInt(str.substring(0,5), 16))
  }
}

import org.apache.spark.sql.functions.udf

val splitHexUDF = udf(splitHex)

df.withColumn("splitId", splitHexUDF(df("id"))).withColumn("id1", $"splitId.part1").withColumn("id2",  $"splitId.part2").drop($"splitId").show()
+---------------+---+-------+
|             id|id1|    id2|
+---------------+---+-------+
|310:120:fe5ab02|  2|1041835|
+---------------+---+-------+

或者，您可以使用以下不带 UDF 的代码段

import org.apache.spark.sql.functions._

val df2 = df.withColumn("splitId", split($"id", ":")(2))
  .withColumn("id1", $"splitId".substr(lit(6), length($"splitId")-1).cast("int"))
  .withColumn("id2", conv(substring($"splitId", 0, 5), 16, 10).cast("int"))
  .drop($"splitId")

df2.printSchema
root
 |-- id: string (nullable = true)
 |-- id1: integer (nullable = true)
 |-- id2: integer (nullable = true)

df2.show()
+---------------+---+-------+
|             id|id1|    id2|
+---------------+---+-------+
|310:120:fe5ab02|  2|1041835|
+---------------+---+-------+

Scala 将列的十六进制子字符串转换为十进制 - Dataframe org.apache.spark.sql.catalyst.parser.ParseException：

Scala Converting hexadecimal substring of column to decimal - Dataframe org.apache.spark.sql.catalyst.parser.ParseException:

scala

substring

dataframe

apache-spark-sql

pyspark-dataframes