PySpark 从 TimeStampType 列向 DataFrame 添加一列
PySpark add a column to a DataFrame from a TimeStampType column
我有一个看起来像那样的 DataFrame。我想在date_time
字段当天操作。
root
|-- host: string (nullable = true)
|-- user_id: string (nullable = true)
|-- date_time: timestamp (nullable = true)
我尝试添加一个列来提取日期。到目前为止我的尝试都失败了。
df = df.withColumn("day", df.date_time.getField("day"))
org.apache.spark.sql.AnalysisException: GetField is not valid on fields of type TimestampType;
这也失败了
df = df.withColumn("day", df.select("date_time").map(lambda row: row.date_time.day))
AttributeError: 'PipelinedRDD' object has no attribute 'alias'
知道如何做到这一点吗?
您可以使用简单的 map
:
df.rdd.map(lambda row:
Row(row.__fields__ + ["day"])(row + (row.date_time.day, ))
)
另一种选择是注册一个函数然后运行 SQL查询:
sqlContext.registerFunction("day", lambda x: x.day)
sqlContext.registerDataFrameAsTable(df, "df")
sqlContext.sql("SELECT *, day(date_time) as day FROM df")
最后你可以这样定义udf:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
day = udf(lambda date_time: date_time.day, IntegerType())
df.withColumn("day", day(df.date_time))
编辑:
实际上,如果你使用原始 SQL day
函数已经定义(至少在 Spark 1.4 中),所以你可以省略 udf 注册。它还提供了许多不同的日期处理功能,包括:
像 year
, month
, dayofmonth
- 像
from_unixtime
and formatters like date_format
这样的解析器
也可以使用简单的日期表达式,例如:
current_timestamp() - expr("INTERVAL 1 HOUR")
这意味着您可以构建相对复杂的查询,而无需将数据传递给 Python。例如:
df = sc.parallelize([
(1, "2016-01-06 00:04:21"),
(2, "2016-05-01 12:20:00"),
(3, "2016-08-06 00:04:21")
]).toDF(["id", "ts_"])
now = lit("2016-06-01 00:00:00").cast("timestamp")
five_months_ago = now - expr("INTERVAL 5 MONTHS")
(df
# Cast string to timestamp
# For Spark 1.5 use cast("double").cast("timestamp")
.withColumn("ts", unix_timestamp("ts_").cast("timestamp"))
# Find all events in the last five months
.where(col("ts").between(five_months_ago, now))
# Find first Sunday after the event
.withColumn("next_sunday", next_day(col("ts"), "Sun"))
# Compute difference in days
.withColumn("diff", datediff(col("ts"), col("next_sunday"))))
res=df.withColumn("dayofts",dayofmonth("ts_"))
from pyspark.sql import functions as F
res=df.withColumn("dayofts",F.dayofmonth("ts_"))
res.show()
我有一个看起来像那样的 DataFrame。我想在date_time
字段当天操作。
root
|-- host: string (nullable = true)
|-- user_id: string (nullable = true)
|-- date_time: timestamp (nullable = true)
我尝试添加一个列来提取日期。到目前为止我的尝试都失败了。
df = df.withColumn("day", df.date_time.getField("day"))
org.apache.spark.sql.AnalysisException: GetField is not valid on fields of type TimestampType;
这也失败了
df = df.withColumn("day", df.select("date_time").map(lambda row: row.date_time.day))
AttributeError: 'PipelinedRDD' object has no attribute 'alias'
知道如何做到这一点吗?
您可以使用简单的 map
:
df.rdd.map(lambda row:
Row(row.__fields__ + ["day"])(row + (row.date_time.day, ))
)
另一种选择是注册一个函数然后运行 SQL查询:
sqlContext.registerFunction("day", lambda x: x.day)
sqlContext.registerDataFrameAsTable(df, "df")
sqlContext.sql("SELECT *, day(date_time) as day FROM df")
最后你可以这样定义udf:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
day = udf(lambda date_time: date_time.day, IntegerType())
df.withColumn("day", day(df.date_time))
编辑:
实际上,如果你使用原始 SQL day
函数已经定义(至少在 Spark 1.4 中),所以你可以省略 udf 注册。它还提供了许多不同的日期处理功能,包括:
像
year
,month
,dayofmonth
- 像
from_unixtime
and formatters likedate_format
这样的解析器
也可以使用简单的日期表达式,例如:
current_timestamp() - expr("INTERVAL 1 HOUR")
这意味着您可以构建相对复杂的查询,而无需将数据传递给 Python。例如:
df = sc.parallelize([
(1, "2016-01-06 00:04:21"),
(2, "2016-05-01 12:20:00"),
(3, "2016-08-06 00:04:21")
]).toDF(["id", "ts_"])
now = lit("2016-06-01 00:00:00").cast("timestamp")
five_months_ago = now - expr("INTERVAL 5 MONTHS")
(df
# Cast string to timestamp
# For Spark 1.5 use cast("double").cast("timestamp")
.withColumn("ts", unix_timestamp("ts_").cast("timestamp"))
# Find all events in the last five months
.where(col("ts").between(five_months_ago, now))
# Find first Sunday after the event
.withColumn("next_sunday", next_day(col("ts"), "Sun"))
# Compute difference in days
.withColumn("diff", datediff(col("ts"), col("next_sunday"))))
res=df.withColumn("dayofts",dayofmonth("ts_"))
from pyspark.sql import functions as F
res=df.withColumn("dayofts",F.dayofmonth("ts_"))
res.show()