使用 Spark / Java 获取每行数据帧的当前时间戳

Question

我想获取每一行的当前时间戳。

我使用下面的代码

dataframe.withColumn("current_date",current_timestamp());

但是 current_timestamp() 在序列化之前进行评估，因此我将始终获得相同的日期。

如何为数据帧的每一行计算 current_timestamp()。

我需要你的帮助。

谢谢。

Answer 1

即使是直接的 python 表达式也被视为序列化时间常量，下面的代码也为每一行提供相同的时间值，

dataframe.withColumn("current_date", F.lit( time.time()))

但是为时间值制作UDF使其在运行时间内解析时间值，如下所示，

from pyspark.sql.functions import udf

def get_time():
    return time.time()

time_udf=udf(get_time)

dataframe.withColumn("current_date", time_udf())

希望对您有所帮助！！

Answer 2

试试这个 -


    df2.withColumn("current_date", expr("reflect('java.lang.System', 'currentTimeMillis')"))
      .show(false)

    /**
      * +-----+------+-------------+
      * |class|gender|current_date |
      * +-----+------+-------------+
      * |1    |m     |1594137247247|
      * |1    |m     |1594137247247|
      * |1    |f     |1594137247247|
      * |2    |f     |1594137247272|
      * |2    |f     |1594137247272|
      * |3    |m     |1594137247272|
      * |3    |m     |1594137247272|
      * +-----+------+-------------+
      */

    df2.withColumn("current_date", expr("reflect('java.time.LocalDateTime', 'now')"))
      .show(false)

    /**
      * +-----+------+-----------------------+
      * |class|gender|current_date           |
      * +-----+------+-----------------------+
      * |1    |m     |2020-07-07T21:24:07.377|
      * |1    |m     |2020-07-07T21:24:07.378|
      * |1    |f     |2020-07-07T21:24:07.378|
      * |2    |f     |2020-07-07T21:24:07.398|
      * |2    |f     |2020-07-07T21:24:07.398|
      * |3    |m     |2020-07-07T21:24:07.398|
      * |3    |m     |2020-07-07T21:24:07.398|
      * +-----+------+-----------------------+
      */
// you can convert current_date to timestamp by casting it to "timestamp"

使用 Spark / Java 获取每行数据帧的当前时间戳

getting the current timestamp of each row of dataframe using Spark / Java

java

dataframe

apache-spark

current-time