如何在 PySpark 中编码 DateTime 值?
How to encode DateTime values in PySpark?
我在 PySpark 中有以下 DataFrame:
itemid eventid timestamp
134 30 2016-07-03
134 32 2016-07-03
125 32 2016-07-10
如何将 timestamp
编码为随机数或随机字符串?例如:
itemid eventid timestamp timestamp_enc
134 30 2016-07-03 1
134 32 2016-07-03 1
125 32 2016-07-10 2
数据帧:
df = (
sc.parallelize([
(134, 30, "2016-07-02"), (134, 32, "2016-07-02"),
(125, 32, "2016-07-10"),
]).toDF(["itemid", "eventid", "timestamp"])
.withColumn("timestamp", col("timestamp").cast("timestamp"))
)
使用函数unix_timestamp创建"random"号码:
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.unix_timestamp
from pyspark.sql.functions import col, unix_timestamp
df = (
sc.parallelize([
(134, 30, "2016-07-02"), (134, 32, "2016-07-02"),
(125, 32, "2016-07-10"),
]).toDF(["itemid", "eventid", "timestamp"])
.withColumn("timestamp", col("timestamp").cast("timestamp"))
)
df.withColumn("timestamp_enc", unix_timestamp(col("timestamp"), format='yyyy-MM-dd')).show()
创建:
+------+-------+-------------------+-------------+
|itemid|eventid| timestamp|timestamp_enc|
+------+-------+-------------------+-------------+
| 134| 30|2016-07-02 00:00:00| 1467417600|
| 134| 32|2016-07-02 00:00:00| 1467417600|
| 125| 32|2016-07-10 00:00:00| 1468108800|
+------+-------+-------------------+-------------+
我在 PySpark 中有以下 DataFrame:
itemid eventid timestamp
134 30 2016-07-03
134 32 2016-07-03
125 32 2016-07-10
如何将 timestamp
编码为随机数或随机字符串?例如:
itemid eventid timestamp timestamp_enc
134 30 2016-07-03 1
134 32 2016-07-03 1
125 32 2016-07-10 2
数据帧:
df = (
sc.parallelize([
(134, 30, "2016-07-02"), (134, 32, "2016-07-02"),
(125, 32, "2016-07-10"),
]).toDF(["itemid", "eventid", "timestamp"])
.withColumn("timestamp", col("timestamp").cast("timestamp"))
)
使用函数unix_timestamp创建"random"号码: https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.unix_timestamp
from pyspark.sql.functions import col, unix_timestamp
df = (
sc.parallelize([
(134, 30, "2016-07-02"), (134, 32, "2016-07-02"),
(125, 32, "2016-07-10"),
]).toDF(["itemid", "eventid", "timestamp"])
.withColumn("timestamp", col("timestamp").cast("timestamp"))
)
df.withColumn("timestamp_enc", unix_timestamp(col("timestamp"), format='yyyy-MM-dd')).show()
创建:
+------+-------+-------------------+-------------+
|itemid|eventid| timestamp|timestamp_enc|
+------+-------+-------------------+-------------+
| 134| 30|2016-07-02 00:00:00| 1467417600|
| 134| 32|2016-07-02 00:00:00| 1467417600|
| 125| 32|2016-07-10 00:00:00| 1468108800|
+------+-------+-------------------+-------------+