将字符串格式的科学计数法转换为 spark 数据框中的数字
convert scientific notation in string format to numeric in spark dataframe
Day_Date,timeofday_desc,Timeofday_hour,Timeofday_minute,Timeofday_second,value
2017-12-18,12:21:02 AM,0,21,2,“1.779209040E+08”
2017-12-19,12:21:02 AM,0,21,2,“1.779209040E+08”
2017-12-20,12:30:52 AM,0,30,52,“1.779209040E+08”
2017-12-21,12:30:52 AM,0,30,52,“1.779209040E+08”
2017-12-22,12:47:10 AM,0,47,10,“1.779209040E+08”
2017-12-23,12:47:10 AM,0,47,10,“1.779209040E+08”
2017-12-24,02:46:59 AM,2,46,59,“1.779209040E+08”
2017-12-25,02:46:59 AM,2,46,59,“1.779209040E+08”
2017-12-26,03:10:27 AM,3,10,27,“1.779209040E+08”
2017-12-27,03:10:27 AM,3,10,27,“1.779209040E+08”
2017-12-28,03:52:08 AM,3,52,8,“1.779209040E+08”
我正在尝试将 value
列转换为 177920904
val df1 = df.withColumn("s", 'value.cast("Decimal(10,4)")).drop("value").withColumnRenamed("s", "value")
还尝试将值转换为 Float
、Double
。总是得到 null 作为输出
df1.select("value").show()
+-----------+
| value |
+-----------+
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
df.printSchema
root
|-- Day_Date: string (nullable = true)
|-- timeofday_desc: string (nullable = true)
|-- Timeofday_hour: string (nullable = true)
|-- Timeofday_minute: string (nullable = true)
|-- Timeofday_second: string (nullable = true)
|-- value: string (nullable = true)
只需要将其转换为十进制,并留出足够的空间来容纳数字。
Decimal是Decimal(precision, scale),所以Decimal(10, 4)表示一共10位,小数点左边6位,右边4位,所以这个数字不适合你的Decimal类型。
来自文档
precision represents the total number of digits that can be
represented
scale represents the number of fractional digits. This value must be
less than or equal to precision. A scale of 0 produces integral
values, with no fractional part
因为你不想要右边的任何数字,你可以试试这个
df.withColumn("s", 'value.cast("Decimal(10,0)"))
如果要保留4位小数,可以改成
df.withColumn("s", 'value.cast("Decimal(14,4)"))
输入
df.show
+---------------+
| value|
+---------------+
|1.779209040E+08|
+---------------+
输出
scala> df.withColumn("s", 'value.cast("Decimal(10,0)")).show
+---------------+---------+
| value| s|
+---------------+---------+
|1.779209040E+08|177920904|
+---------------+---------+
完整解决方案
不删除也不重命名
val df1 = df.withColumn("value", 'value.cast("Decimal(10,0)"))
修复输入数据
正如我在评论中所说,问题是您的数字周围包含一些奇怪的字符,您应该在转换前将其删除
原创
scala> df.show
+----------+--------------+--------------+----------------+----------------+-----------------+
| Day_Date|timeofday_desc|Timeofday_hour|Timeofday_minute|Timeofday_second| value|
+----------+--------------+--------------+----------------+----------------+-----------------+
|2017-12-18| 12:21:02 AM| 0| 21| 2| ?1.779209040E+08|
|2017-12-19| 12:21:02 AM| 0| 21| 2|?1.779209040E+08?|
|2017-12-20| 12:30:52 AM| 0| 30| 52| ?1.779209040E+08|
|2017-12-21| 12:30:52 AM| 0| 30| 52| ?1.779209040E+08|
|2017-12-22| 12:47:10 AM| 0| 47| 10| ?1.779209040E+08|
|2017-12-23| 12:47:10 AM| 0| 47| 10| ?1.779209040E+08|
|2017-12-24| 02:46:59 AM| 2| 46| 59| ?1.779209040E+08|
|2017-12-25| 02:46:59 AM| 2| 46| 59| ?1.779209040E+08|
|2017-12-26| 03:10:27 AM| 3| 10| 27| ?1.779209040E+08|
|2017-12-27| 03:10:27 AM| 3| 10| 27| ?1.779209040E+08|
|2017-12-28| 03:52:08 AM| 3| 52| 8| ?1.779209040E+08|
+----------+--------------+--------------+----------------+----------------+-----------------+
有很多方法可以删除它们,一种快速的方法是使用 UDF 和正则表达式来删除除数字、字母、点、+ 和 - 之外的所有内容
def clean(input: String) = input.replaceAll("[^a-zA-Z0-9\+\.-]", "")
val cleanUDF = udf(clean _ )
df.withColumn("value", cleanUDF($"value").cast(DecimalType(10,0))).show
+----------+--------------+--------------+----------------+----------------+---------+
| Day_Date|timeofday_desc|Timeofday_hour|Timeofday_minute|Timeofday_second| value|
+----------+--------------+--------------+----------------+----------------+---------+
|2017-12-18| 12:21:02 AM| 0| 21| 2|177920904|
|2017-12-19| 12:21:02 AM| 0| 21| 2|177920904|
|2017-12-20| 12:30:52 AM| 0| 30| 52|177920904|
|2017-12-21| 12:30:52 AM| 0| 30| 52|177920904|
|2017-12-22| 12:47:10 AM| 0| 47| 10|177920904|
|2017-12-23| 12:47:10 AM| 0| 47| 10|177920904|
|2017-12-24| 02:46:59 AM| 2| 46| 59|177920904|
|2017-12-25| 02:46:59 AM| 2| 46| 59|177920904|
|2017-12-26| 03:10:27 AM| 3| 10| 27|177920904|
|2017-12-27| 03:10:27 AM| 3| 10| 27|177920904|
|2017-12-28| 03:52:08 AM| 3| 52| 8|177920904|
+----------+--------------+--------------+----------------+----------------+---------+
Day_Date,timeofday_desc,Timeofday_hour,Timeofday_minute,Timeofday_second,value
2017-12-18,12:21:02 AM,0,21,2,“1.779209040E+08”
2017-12-19,12:21:02 AM,0,21,2,“1.779209040E+08”
2017-12-20,12:30:52 AM,0,30,52,“1.779209040E+08”
2017-12-21,12:30:52 AM,0,30,52,“1.779209040E+08”
2017-12-22,12:47:10 AM,0,47,10,“1.779209040E+08”
2017-12-23,12:47:10 AM,0,47,10,“1.779209040E+08”
2017-12-24,02:46:59 AM,2,46,59,“1.779209040E+08”
2017-12-25,02:46:59 AM,2,46,59,“1.779209040E+08”
2017-12-26,03:10:27 AM,3,10,27,“1.779209040E+08”
2017-12-27,03:10:27 AM,3,10,27,“1.779209040E+08”
2017-12-28,03:52:08 AM,3,52,8,“1.779209040E+08”
我正在尝试将 value
列转换为 177920904
val df1 = df.withColumn("s", 'value.cast("Decimal(10,4)")).drop("value").withColumnRenamed("s", "value")
还尝试将值转换为 Float
、Double
。总是得到 null 作为输出
df1.select("value").show()
+-----------+
| value |
+-----------+
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
df.printSchema
root
|-- Day_Date: string (nullable = true)
|-- timeofday_desc: string (nullable = true)
|-- Timeofday_hour: string (nullable = true)
|-- Timeofday_minute: string (nullable = true)
|-- Timeofday_second: string (nullable = true)
|-- value: string (nullable = true)
只需要将其转换为十进制,并留出足够的空间来容纳数字。
Decimal是Decimal(precision, scale),所以Decimal(10, 4)表示一共10位,小数点左边6位,右边4位,所以这个数字不适合你的Decimal类型。
来自文档
precision represents the total number of digits that can be represented
scale represents the number of fractional digits. This value must be less than or equal to precision. A scale of 0 produces integral values, with no fractional part
因为你不想要右边的任何数字,你可以试试这个
df.withColumn("s", 'value.cast("Decimal(10,0)"))
如果要保留4位小数,可以改成
df.withColumn("s", 'value.cast("Decimal(14,4)"))
输入
df.show
+---------------+
| value|
+---------------+
|1.779209040E+08|
+---------------+
输出
scala> df.withColumn("s", 'value.cast("Decimal(10,0)")).show
+---------------+---------+
| value| s|
+---------------+---------+
|1.779209040E+08|177920904|
+---------------+---------+
完整解决方案
不删除也不重命名
val df1 = df.withColumn("value", 'value.cast("Decimal(10,0)"))
修复输入数据
正如我在评论中所说,问题是您的数字周围包含一些奇怪的字符,您应该在转换前将其删除
原创
scala> df.show
+----------+--------------+--------------+----------------+----------------+-----------------+
| Day_Date|timeofday_desc|Timeofday_hour|Timeofday_minute|Timeofday_second| value|
+----------+--------------+--------------+----------------+----------------+-----------------+
|2017-12-18| 12:21:02 AM| 0| 21| 2| ?1.779209040E+08|
|2017-12-19| 12:21:02 AM| 0| 21| 2|?1.779209040E+08?|
|2017-12-20| 12:30:52 AM| 0| 30| 52| ?1.779209040E+08|
|2017-12-21| 12:30:52 AM| 0| 30| 52| ?1.779209040E+08|
|2017-12-22| 12:47:10 AM| 0| 47| 10| ?1.779209040E+08|
|2017-12-23| 12:47:10 AM| 0| 47| 10| ?1.779209040E+08|
|2017-12-24| 02:46:59 AM| 2| 46| 59| ?1.779209040E+08|
|2017-12-25| 02:46:59 AM| 2| 46| 59| ?1.779209040E+08|
|2017-12-26| 03:10:27 AM| 3| 10| 27| ?1.779209040E+08|
|2017-12-27| 03:10:27 AM| 3| 10| 27| ?1.779209040E+08|
|2017-12-28| 03:52:08 AM| 3| 52| 8| ?1.779209040E+08|
+----------+--------------+--------------+----------------+----------------+-----------------+
有很多方法可以删除它们,一种快速的方法是使用 UDF 和正则表达式来删除除数字、字母、点、+ 和 - 之外的所有内容
def clean(input: String) = input.replaceAll("[^a-zA-Z0-9\+\.-]", "")
val cleanUDF = udf(clean _ )
df.withColumn("value", cleanUDF($"value").cast(DecimalType(10,0))).show
+----------+--------------+--------------+----------------+----------------+---------+
| Day_Date|timeofday_desc|Timeofday_hour|Timeofday_minute|Timeofday_second| value|
+----------+--------------+--------------+----------------+----------------+---------+
|2017-12-18| 12:21:02 AM| 0| 21| 2|177920904|
|2017-12-19| 12:21:02 AM| 0| 21| 2|177920904|
|2017-12-20| 12:30:52 AM| 0| 30| 52|177920904|
|2017-12-21| 12:30:52 AM| 0| 30| 52|177920904|
|2017-12-22| 12:47:10 AM| 0| 47| 10|177920904|
|2017-12-23| 12:47:10 AM| 0| 47| 10|177920904|
|2017-12-24| 02:46:59 AM| 2| 46| 59|177920904|
|2017-12-25| 02:46:59 AM| 2| 46| 59|177920904|
|2017-12-26| 03:10:27 AM| 3| 10| 27|177920904|
|2017-12-27| 03:10:27 AM| 3| 10| 27|177920904|
|2017-12-28| 03:52:08 AM| 3| 52| 8|177920904|
+----------+--------------+--------------+----------------+----------------+---------+