Scala 将列的十六进制子字符串转换为十进制 - Dataframe org.apache.spark.sql.catalyst.parser.ParseException:
Scala Converting hexadecimal substring of column to decimal - Dataframe org.apache.spark.sql.catalyst.parser.ParseException:
val DF = Seq("310:120:fe5ab02").toDF("id")
+-----------------+
| id |
+-----------------+
| 310:120:fe5ab02 |
+-----------------+
+-----------------+-------------+--------+
| id | id1 | id2 |
+-----------------+-------------+--------+
| 310:120:fe5ab02 | 2 | 1041835|
+-----------------+-------------+--------+
我需要将列中字符串的两个子字符串从十六进制转换为十进制,并在 Dataframe 中创建两个新列。
id1-> 310:120:fe5ab02 ->x.(split(":")(2)) -> fe5ab02 -> substring(5) -> 02 -> ParseInt(x,16) -> 2
id2-> 310:120:fe5ab02 ->x.(split(":")(2)) -> fe5ab02 -> substring(0,5) -> fe5ab -> ParseInt(x,16) -> 1041835
从“310:120:fe5ab02”我需要“fe5ab02”,我通过 x.split(“:”)(2)
然后我需要两个子串“fe5ab”和“02”,我通过 x.substring(0,5),x.substring(5)
然后我需要将它们转换成我通过 Integer.parseInt(x,16)
得到的十进制
这些单独使用效果很好,但我需要在单个 withColumn 语句中使用它们,如下所示
val DF1 = DF
.withColumn("id1", expr("""Integer.parseInt((id.split(":")(2)).substring(5), 16)"""))
.withColumn("id2", expr("""Integer.parseInt((id.split(":")(2)).substring(0, 5), 16)"""))
display(DF1)
我遇到解析异常。
case class SplitId(part1: Int, part2: Int)
def splitHex: (String => SplitId) = { s => {
val str: String = s.split(":")(2)
SplitId(Integer.parseInt(str.substring(5), 16), Integer.parseInt(str.substring(0,5), 16))
}
}
import org.apache.spark.sql.functions.udf
val splitHexUDF = udf(splitHex)
df.withColumn("splitId", splitHexUDF(df("id"))).withColumn("id1", $"splitId.part1").withColumn("id2", $"splitId.part2").drop($"splitId").show()
+---------------+---+-------+
| id|id1| id2|
+---------------+---+-------+
|310:120:fe5ab02| 2|1041835|
+---------------+---+-------+
或者,您可以使用以下不带 UDF 的代码段
import org.apache.spark.sql.functions._
val df2 = df.withColumn("splitId", split($"id", ":")(2))
.withColumn("id1", $"splitId".substr(lit(6), length($"splitId")-1).cast("int"))
.withColumn("id2", conv(substring($"splitId", 0, 5), 16, 10).cast("int"))
.drop($"splitId")
df2.printSchema
root
|-- id: string (nullable = true)
|-- id1: integer (nullable = true)
|-- id2: integer (nullable = true)
df2.show()
+---------------+---+-------+
| id|id1| id2|
+---------------+---+-------+
|310:120:fe5ab02| 2|1041835|
+---------------+---+-------+
val DF = Seq("310:120:fe5ab02").toDF("id")
+-----------------+
| id |
+-----------------+
| 310:120:fe5ab02 |
+-----------------+
+-----------------+-------------+--------+
| id | id1 | id2 |
+-----------------+-------------+--------+
| 310:120:fe5ab02 | 2 | 1041835|
+-----------------+-------------+--------+
我需要将列中字符串的两个子字符串从十六进制转换为十进制,并在 Dataframe 中创建两个新列。
id1-> 310:120:fe5ab02 ->x.(split(":")(2)) -> fe5ab02 -> substring(5) -> 02 -> ParseInt(x,16) -> 2
id2-> 310:120:fe5ab02 ->x.(split(":")(2)) -> fe5ab02 -> substring(0,5) -> fe5ab -> ParseInt(x,16) -> 1041835
从“310:120:fe5ab02”我需要“fe5ab02”,我通过 x.split(“:”)(2) 然后我需要两个子串“fe5ab”和“02”,我通过 x.substring(0,5),x.substring(5) 然后我需要将它们转换成我通过 Integer.parseInt(x,16)
得到的十进制这些单独使用效果很好,但我需要在单个 withColumn 语句中使用它们,如下所示
val DF1 = DF
.withColumn("id1", expr("""Integer.parseInt((id.split(":")(2)).substring(5), 16)"""))
.withColumn("id2", expr("""Integer.parseInt((id.split(":")(2)).substring(0, 5), 16)"""))
display(DF1)
我遇到解析异常。
case class SplitId(part1: Int, part2: Int)
def splitHex: (String => SplitId) = { s => {
val str: String = s.split(":")(2)
SplitId(Integer.parseInt(str.substring(5), 16), Integer.parseInt(str.substring(0,5), 16))
}
}
import org.apache.spark.sql.functions.udf
val splitHexUDF = udf(splitHex)
df.withColumn("splitId", splitHexUDF(df("id"))).withColumn("id1", $"splitId.part1").withColumn("id2", $"splitId.part2").drop($"splitId").show()
+---------------+---+-------+
| id|id1| id2|
+---------------+---+-------+
|310:120:fe5ab02| 2|1041835|
+---------------+---+-------+
或者,您可以使用以下不带 UDF 的代码段
import org.apache.spark.sql.functions._
val df2 = df.withColumn("splitId", split($"id", ":")(2))
.withColumn("id1", $"splitId".substr(lit(6), length($"splitId")-1).cast("int"))
.withColumn("id2", conv(substring($"splitId", 0, 5), 16, 10).cast("int"))
.drop($"splitId")
df2.printSchema
root
|-- id: string (nullable = true)
|-- id1: integer (nullable = true)
|-- id2: integer (nullable = true)
df2.show()
+---------------+---+-------+
| id|id1| id2|
+---------------+---+-------+
|310:120:fe5ab02| 2|1041835|
+---------------+---+-------+