字符串到时间戳 Pyspark / Spark
String to Timestamp Pyspark / Spark
我需要将输入字符串 08-DEC-2011 00.00.00 转换为时间戳 20111208000000 - 文件中的数据。
"CLIENTCONTEXTID","SRVR","CLNT","USERNAME","UPDATEDTM"
1202,"jbosswabcd6","100.126.164.172","SUSER",08-DEC-2011 00.00.00
1653,"jbossbabcd4","17.26.164.176","SUSER",08-DEC-2011 00.00.00
1654,"jbossbabcd4","12.26.164.221","SUSER",08-JAN-2011 00.00.00
1655,"jbossbabcd4","17.26.164.223","SuperUser",08-DEC-2011 00.00.00
1656,"jbossbabcd4","17.26.164.221","SUSER",08-DEC-2011 00.00.00
1657,"jbossbabcd4","17.26.164.226","SUSER",08-DEC-2011 00.00.00
1658,"jbossbabcd4","100.26.164.221","SUSER",08-DEC-2011 00.00.00
1659,"jbossbabcd4","100.26.164.221","SUSER",08-DEC-2011 00.00.00
2802,"jbosswabcd1","172.20.19.130","SuperUser",08-DEC-2011 00.00.00
作为第一步,我尝试将其转换为时间戳
from pyspark.sql import functions as F
from pyspark.sql.functions import unix_timestamp
df = spark.read.csv('/hdfs/context.csv',header=True)
df.printSchema()
df.createOrReplaceTempView("tablesss")
spark.sql('select UPDATEDTM,cast(unix_timestamp(UPDATEDTM,"dd-MMM-YYYY HH.mm.ss") as timestamp) columnn12 from tablesss').show()
我也尝试了另一种方式
df = spark.read.csv('/hdfs/context.csv',header=True)
df.printSchema()
df=df.withColumn("datetype_timestamp",F.to_timestamp(F.col("UPDATEDTM"),"dd-MMM-YYYY HH.mm.ss"))
df.select(F.col("UPDATEDTM"),F.col("datetype_timestamp")).show()
两者的结果如下
+--------------------+-------------------+
| UPDATEDTM| datetype_timestamp|
+--------------------+-------------------+
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
但是,如果我使用 scala
val df = spark.read.option("header",true).csv("/hdfs/context.csv")
df.withColumn("datetype_timestamp", to_timestamp(col("UPDATEDTM"),"dd-MMM-yyyy HH.mm.ss")).show(false)
得到的结果是
+--------------------+-------------------+
|UPDATEDTM |datetype_timestamp |
+--------------------+-------------------+
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00
不确定 Scala 和 Pyspark 之间的区别。同样的反应很奇怪。
在这个时间戳之后,我想到应用 Python strftime 来转换为所需的格式。
请帮助确定此转换的问题和最佳方式
使用yyyy
(普通日历年)而不是YYYY
(基于周的日历)。
在您使用的 scala 代码中 yyyy
和在 pyspark 中 YYYY
.
Example:
df.show()
#+--------------------+
#| UPDATEDTM|
#+--------------------+
#|08-DEC-2011 00.00.00|
#+--------------------+
df.withColumn("dd",to_timestamp(col("UPDATEDTM"),'dd-MMM-yyyy HH.mm.ss')).show()
#using unix_timestamp function
df.withColumn("dd",unix_timestamp(col("UPDATEDTM"),'dd-MMM-yyyy HH.mm.ss').cast("timestamp")).show()
#+--------------------+-------------------+
#| UPDATEDTM| dd|
#+--------------------+-------------------+
#|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
#+--------------------+-------------------+
我需要将输入字符串 08-DEC-2011 00.00.00 转换为时间戳 20111208000000 - 文件中的数据。
"CLIENTCONTEXTID","SRVR","CLNT","USERNAME","UPDATEDTM"
1202,"jbosswabcd6","100.126.164.172","SUSER",08-DEC-2011 00.00.00
1653,"jbossbabcd4","17.26.164.176","SUSER",08-DEC-2011 00.00.00
1654,"jbossbabcd4","12.26.164.221","SUSER",08-JAN-2011 00.00.00
1655,"jbossbabcd4","17.26.164.223","SuperUser",08-DEC-2011 00.00.00
1656,"jbossbabcd4","17.26.164.221","SUSER",08-DEC-2011 00.00.00
1657,"jbossbabcd4","17.26.164.226","SUSER",08-DEC-2011 00.00.00
1658,"jbossbabcd4","100.26.164.221","SUSER",08-DEC-2011 00.00.00
1659,"jbossbabcd4","100.26.164.221","SUSER",08-DEC-2011 00.00.00
2802,"jbosswabcd1","172.20.19.130","SuperUser",08-DEC-2011 00.00.00
作为第一步,我尝试将其转换为时间戳
from pyspark.sql import functions as F
from pyspark.sql.functions import unix_timestamp
df = spark.read.csv('/hdfs/context.csv',header=True)
df.printSchema()
df.createOrReplaceTempView("tablesss")
spark.sql('select UPDATEDTM,cast(unix_timestamp(UPDATEDTM,"dd-MMM-YYYY HH.mm.ss") as timestamp) columnn12 from tablesss').show()
我也尝试了另一种方式
df = spark.read.csv('/hdfs/context.csv',header=True)
df.printSchema()
df=df.withColumn("datetype_timestamp",F.to_timestamp(F.col("UPDATEDTM"),"dd-MMM-YYYY HH.mm.ss"))
df.select(F.col("UPDATEDTM"),F.col("datetype_timestamp")).show()
两者的结果如下
+--------------------+-------------------+
| UPDATEDTM| datetype_timestamp|
+--------------------+-------------------+
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
但是,如果我使用 scala
val df = spark.read.option("header",true).csv("/hdfs/context.csv")
df.withColumn("datetype_timestamp", to_timestamp(col("UPDATEDTM"),"dd-MMM-yyyy HH.mm.ss")).show(false)
得到的结果是
+--------------------+-------------------+
|UPDATEDTM |datetype_timestamp |
+--------------------+-------------------+
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00
不确定 Scala 和 Pyspark 之间的区别。同样的反应很奇怪。
在这个时间戳之后,我想到应用 Python strftime 来转换为所需的格式。
请帮助确定此转换的问题和最佳方式
使用yyyy
(普通日历年)而不是YYYY
(基于周的日历)。
在您使用的 scala 代码中 yyyy
和在 pyspark 中 YYYY
.
Example:
df.show()
#+--------------------+
#| UPDATEDTM|
#+--------------------+
#|08-DEC-2011 00.00.00|
#+--------------------+
df.withColumn("dd",to_timestamp(col("UPDATEDTM"),'dd-MMM-yyyy HH.mm.ss')).show()
#using unix_timestamp function
df.withColumn("dd",unix_timestamp(col("UPDATEDTM"),'dd-MMM-yyyy HH.mm.ss').cast("timestamp")).show()
#+--------------------+-------------------+
#| UPDATEDTM| dd|
#+--------------------+-------------------+
#|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
#+--------------------+-------------------+