Pyspark date_trunc 不修改实际值
Pyspark date_trunc without modifying actual value
考虑下面的数据框
df:
time
2022-02-21T11:23:54
我必须把它转换成
time
2022-02-21T11:23:00
使用下面的代码后
df.withColumn("time_updated", date_trunc("minute", col("time"))).show(truncate = False)
我的输出
time
2022-02-21 11:23:00
所需的输出是
time
2022-02-21T11:23:00
有没有办法让数据保持不变,只是 update/truncate 秒??
您只是遇到了格式问题。您看到的输出是时间戳的字符串表示形式。检查您的输出格式:
from pyspark.sql import functions as F, Window as W, types as T
df = df.withColumn(
"time_updated",
F.date_format(F.col("time").cast("timestamp"), "YYYY-MM-dd'T'HH:mm:00"),
)
df.show(truncate=False)
+-------------------+-------------------+
|time |time_updated |
+-------------------+-------------------+
|2022-02-21T11:23:54|2022-02-21T11:23:00|
+-------------------+-------------------+
df.printSchema()
root
|-- time: string (nullable = true)
|-- time_updated: string (nullable = true)
考虑下面的数据框
df:
time |
---|
2022-02-21T11:23:54 |
我必须把它转换成
time |
---|
2022-02-21T11:23:00 |
使用下面的代码后
df.withColumn("time_updated", date_trunc("minute", col("time"))).show(truncate = False)
我的输出
time |
---|
2022-02-21 11:23:00 |
所需的输出是
time |
---|
2022-02-21T11:23:00 |
有没有办法让数据保持不变,只是 update/truncate 秒??
您只是遇到了格式问题。您看到的输出是时间戳的字符串表示形式。检查您的输出格式:
from pyspark.sql import functions as F, Window as W, types as T
df = df.withColumn(
"time_updated",
F.date_format(F.col("time").cast("timestamp"), "YYYY-MM-dd'T'HH:mm:00"),
)
df.show(truncate=False)
+-------------------+-------------------+
|time |time_updated |
+-------------------+-------------------+
|2022-02-21T11:23:54|2022-02-21T11:23:00|
+-------------------+-------------------+
df.printSchema()
root
|-- time: string (nullable = true)
|-- time_updated: string (nullable = true)