spark sql 字符串到时间戳缺失毫秒数
spark sql string to timestamp missing milliseconds
为什么是:
import spark.implicits._
val content = Seq(("2019", "09", "11","17","16","54","762000000")).toDF("year", "month", "day", "hour", "minute", "second", "nano")
content.printSchema
content.show
content.withColumn("event_time_utc", to_timestamp(concat('year, 'month, 'day, 'hour, 'minute, 'second), "yyyyMMddHHmmss"))
.withColumn("event_time_utc_millis", to_timestamp(concat('year, 'month, 'day, 'hour, 'minute, 'second, substring('nano, 0, 3)), "yyyyMMddHHmmssSSS"))
.select('year, 'month, 'day, 'hour, 'minute, 'second, 'nano,substring('nano, 0, 3), 'event_time_utc, 'event_time_utc_millis)
.show
缺少毫秒?
+----+-----+---+----+------+------+---------+---------------------+-------------------+---------------------+
|year|month|day|hour|minute|second| nano|substring(nano, 0, 3)| event_time_utc|event_time_utc_millis|
+----+-----+---+----+------+------+---------+---------------------+-------------------+---------------------+
|2019| 09| 11| 17| 16| 54|762000000| 762|2019-09-11 17:16:54| 2019-09-11 17:16:54|
+----+-----+---+----+------+------+---------+---------------------+-------------------+---------------------+
对于格式字符串:yyyyMMddHHmmssSSS
如果我没记错的话,它应该包括 SSS
中的毫秒数。
尝试在此标准中进行连接:"yyyy-MM-dd HH:mm:ss.ssss"(忽略零,例如:“762000000”,因为 nano/milliseconds 变为“762”)
youDataframe
.withColumn("dateTime_complete",
concat_ws(" ", concat_ws("-", col("year"), col("month"), col("day")),
concat_ws(":", col("hour"), col("minute"), concat_ws(".", col("second"), col("nano")))))
.withColumn("your_new_column", to_utc_timestamp(col("dateTime_complete"), "yyyy-MM-dd HH:mm:ss.sss"))
我遇到过类似的问题,官方 Document 说下面一行直到 spark <2.4:
Convert time string to a Unix timestamp (in seconds) with a specified
format (see
[http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html])
to Unix timestamp (in seconds), return null if fail.
这意味着它只处理秒。
Spark>= 2.4 也可以处理 SSS
。
解决方案:下面的UDF将有助于处理这种情况:
import java.text.SimpleDateFormat
import java.sql.Timestamp
import org.apache.spark.sql.functions._
import scala.util.{Try, Success, Failure}
val getTimestampWithMilis: ((String , String) => Option[Timestamp]) = (input, frmt) => input match {
case "" => None
case _ => {
val format = new SimpleDateFormat(frmt)
Try(new Timestamp(format.parse(input).getTime)) match {
case Success(t) => Some(t)
case Failure(_) => None
}
}
}
val getTimestampWithMilisUDF = udf(getTimestampWithMilis)
举个例子:
val content = Seq(("2019", "09", "11","17","16","54","762000000")).toDF("year", "month", "day", "hour", "minute", "second", "nano")
val df = content.withColumn("event_time_utc", concat('year, 'month, 'day, 'hour, 'minute, 'second, substring('nano, 0, 3)))
df.show
+----+-----+---+----+------+------+---------+-----------------+
|year|month|day|hour|minute|second| nano| event_time_utc|
+----+-----+---+----+------+------+---------+-----------------+
|2019| 09| 11| 17| 16| 54|762000000|20190911171654762|
+----+-----+---+----+------+------+---------+-----------------+
df.withColumn("event_time_utc_millis", getTimestampWithMilisUDF($"event_time_utc", lit("yyyyMMddHHmmssSSS"))).show(1, false)
+----+-----+---+----+------+------+---------+-----------------+-----------------------+
|year|month|day|hour|minute|second|nano |event_time_utc |event_time_utc_millis |
+----+-----+---+----+------+------+---------+-----------------+-----------------------+
|2019|09 |11 |17 |16 |54 |762000000|20190911171654762|2019-09-11 17:16:54.762|
+----+-----+---+----+------+------+---------+-----------------+-----------------------+
root
|-- year: string (nullable = true)
|-- month: string (nullable = true)
|-- day: string (nullable = true)
|-- hour: string (nullable = true)
|-- minute: string (nullable = true)
|-- second: string (nullable = true)
|-- nano: string (nullable = true)
|-- event_time_utc: string (nullable = true)
|-- event_time_utc_millis: timestamp (nullable = true)
为什么是:
import spark.implicits._
val content = Seq(("2019", "09", "11","17","16","54","762000000")).toDF("year", "month", "day", "hour", "minute", "second", "nano")
content.printSchema
content.show
content.withColumn("event_time_utc", to_timestamp(concat('year, 'month, 'day, 'hour, 'minute, 'second), "yyyyMMddHHmmss"))
.withColumn("event_time_utc_millis", to_timestamp(concat('year, 'month, 'day, 'hour, 'minute, 'second, substring('nano, 0, 3)), "yyyyMMddHHmmssSSS"))
.select('year, 'month, 'day, 'hour, 'minute, 'second, 'nano,substring('nano, 0, 3), 'event_time_utc, 'event_time_utc_millis)
.show
缺少毫秒?
+----+-----+---+----+------+------+---------+---------------------+-------------------+---------------------+
|year|month|day|hour|minute|second| nano|substring(nano, 0, 3)| event_time_utc|event_time_utc_millis|
+----+-----+---+----+------+------+---------+---------------------+-------------------+---------------------+
|2019| 09| 11| 17| 16| 54|762000000| 762|2019-09-11 17:16:54| 2019-09-11 17:16:54|
+----+-----+---+----+------+------+---------+---------------------+-------------------+---------------------+
对于格式字符串:yyyyMMddHHmmssSSS
如果我没记错的话,它应该包括 SSS
中的毫秒数。
尝试在此标准中进行连接:"yyyy-MM-dd HH:mm:ss.ssss"(忽略零,例如:“762000000”,因为 nano/milliseconds 变为“762”)
youDataframe
.withColumn("dateTime_complete",
concat_ws(" ", concat_ws("-", col("year"), col("month"), col("day")),
concat_ws(":", col("hour"), col("minute"), concat_ws(".", col("second"), col("nano")))))
.withColumn("your_new_column", to_utc_timestamp(col("dateTime_complete"), "yyyy-MM-dd HH:mm:ss.sss"))
我遇到过类似的问题,官方 Document 说下面一行直到 spark <2.4:
Convert time string to a Unix timestamp (in seconds) with a specified format (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix timestamp (in seconds), return null if fail.
这意味着它只处理秒。
Spark>= 2.4 也可以处理 SSS
。
解决方案:下面的UDF将有助于处理这种情况:
import java.text.SimpleDateFormat
import java.sql.Timestamp
import org.apache.spark.sql.functions._
import scala.util.{Try, Success, Failure}
val getTimestampWithMilis: ((String , String) => Option[Timestamp]) = (input, frmt) => input match {
case "" => None
case _ => {
val format = new SimpleDateFormat(frmt)
Try(new Timestamp(format.parse(input).getTime)) match {
case Success(t) => Some(t)
case Failure(_) => None
}
}
}
val getTimestampWithMilisUDF = udf(getTimestampWithMilis)
举个例子:
val content = Seq(("2019", "09", "11","17","16","54","762000000")).toDF("year", "month", "day", "hour", "minute", "second", "nano")
val df = content.withColumn("event_time_utc", concat('year, 'month, 'day, 'hour, 'minute, 'second, substring('nano, 0, 3)))
df.show
+----+-----+---+----+------+------+---------+-----------------+
|year|month|day|hour|minute|second| nano| event_time_utc|
+----+-----+---+----+------+------+---------+-----------------+
|2019| 09| 11| 17| 16| 54|762000000|20190911171654762|
+----+-----+---+----+------+------+---------+-----------------+
df.withColumn("event_time_utc_millis", getTimestampWithMilisUDF($"event_time_utc", lit("yyyyMMddHHmmssSSS"))).show(1, false)
+----+-----+---+----+------+------+---------+-----------------+-----------------------+
|year|month|day|hour|minute|second|nano |event_time_utc |event_time_utc_millis |
+----+-----+---+----+------+------+---------+-----------------+-----------------------+
|2019|09 |11 |17 |16 |54 |762000000|20190911171654762|2019-09-11 17:16:54.762|
+----+-----+---+----+------+------+---------+-----------------+-----------------------+
root
|-- year: string (nullable = true)
|-- month: string (nullable = true)
|-- day: string (nullable = true)
|-- hour: string (nullable = true)
|-- minute: string (nullable = true)
|-- second: string (nullable = true)
|-- nano: string (nullable = true)
|-- event_time_utc: string (nullable = true)
|-- event_time_utc_millis: timestamp (nullable = true)