如何在 PySpark 中编码 DateTime 值?

How to encode DateTime values in PySpark?

我在 PySpark 中有以下 DataFrame:

itemid  eventid    timestamp
134     30         2016-07-03 
134     32         2016-07-03 
125     32         2016-07-10

如何将 timestamp 编码为随机数或随机字符串?例如:

itemid  eventid    timestamp   timestamp_enc
134     30         2016-07-03  1
134     32         2016-07-03  1
125     32         2016-07-10  2

数据帧:

df = (
    sc.parallelize([
        (134, 30, "2016-07-02"), (134, 32, "2016-07-02"),
        (125, 32, "2016-07-10"),
    ]).toDF(["itemid", "eventid", "timestamp"])
    .withColumn("timestamp", col("timestamp").cast("timestamp"))
)

使用函数unix_timestamp创建"random"号码: https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.unix_timestamp

from pyspark.sql.functions import col, unix_timestamp

df = (
    sc.parallelize([
        (134, 30, "2016-07-02"), (134, 32, "2016-07-02"),
        (125, 32, "2016-07-10"),
    ]).toDF(["itemid", "eventid", "timestamp"])
    .withColumn("timestamp", col("timestamp").cast("timestamp"))
)

df.withColumn("timestamp_enc", unix_timestamp(col("timestamp"), format='yyyy-MM-dd')).show()

创建:

+------+-------+-------------------+-------------+
|itemid|eventid|          timestamp|timestamp_enc|
+------+-------+-------------------+-------------+
|   134|     30|2016-07-02 00:00:00|   1467417600|
|   134|     32|2016-07-02 00:00:00|   1467417600|
|   125|     32|2016-07-10 00:00:00|   1468108800|
+------+-------+-------------------+-------------+