在 Spark 中将 long 类型列转换为 calendarinterval 类型 SQL
Convert long type column to calendarinterval type in Spark SQL
两个查询:
如何将以秒为单位的长型列转换为以 Python 为单位的日历间隔类型
火花 SQL?
如何将以下代码转换为纯 Spark SQL 查询:
from pyspark.sql.functions import unix_timestamp
df2 = df.withColumn(
"difference_duration",
unix_timestamp("CAL_COMPLETION_TIME") - unix_timestamp("Prev_Time")
)
示例数据帧 SS:
基本上我正在尝试在 Spark SQL:
中的 PGSQL 查询下实现此目的
case
when t1.prev_time <> t1.prev_time_calc and t1."CAL_COMPLETION_TIME" - t1.prev_time < interval '30 min'
then t1.next_time_calc - t1.prev_time_calc
when (t1.next_time <> t1.next_time_calc and t1.next_time - t1."CAL_COMPLETION_TIME" < interval '30 min') or (t1.next_time - t1."CAL_COMPLETION_TIME" < interval '30 min')
then t1.next_time_calc - t1."CAL_COMPLETION_TIME"
else null
end min_diff
但是这部分 t1."CAL_COMPLETION_TIME" - t1.prev_time < interval '30 min' 抛出以下错误:
AnalysisException: "cannot resolve '(t1.`CAL_COMPLETION_TIME` - t1.`prev_time`)' due to data type mismatch: '(t1.`CAL_COMPLETION_TIME` - t1.`prev_time`)' requires (numeric or calendarinterval) type, not timestamp;
您不能减去时间戳,您需要将它们转换为秒。因此,您正在寻找的是将时间戳列转换为long/bigint如你所减,divide by 60
得到分钟值,然后看是不是less than 30
.
#example=df1
#both columns are of type Timestamp
+-------------------+-------------------+
| prev_time|CAL_COMPLETION_TIME|
+-------------------+-------------------+
|2019-04-26 01:19:10|2019-04-26 01:19:35|
+-------------------+-------------------+
Pyspark:
df1.withColumn("sub", F.when(((F.col("CAL_COMPLETION_TIME").cast("long")-F.col("prev_time").cast("long"))/60 < 30), F.lit("LESSTHAN30")).otherwise(F.lit("GREATERTHAN"))).show()
+-------------------+-------------------+----------+
| prev_time|CAL_COMPLETION_TIME| sub|
+-------------------+-------------------+----------+
|2019-04-26 01:19:10|2019-04-26 01:19:35|LESSTHAN30|
+-------------------+-------------------+----------+
Spark.sql
df1.createOrReplaceTempView("df1")
spark.sql("select prev_time, CAL_COMPLETION_TIME, IF(((CAST(CAL_COMPLETION_TIME as bigint) - CAST(prev_time as bigint))/60)<30,'LESSTHAN30','GREATER') as difference_duration from df1").show()
+-------------------+-------------------+-------------------+
| prev_time|CAL_COMPLETION_TIME|difference_duration|
+-------------------+-------------------+-------------------+
|2019-04-26 01:19:10|2019-04-26 01:19:35| LESSTHAN30|
+-------------------+-------------------+-------------------+
两个查询:
如何将以秒为单位的长型列转换为以 Python 为单位的日历间隔类型 火花 SQL?
如何将以下代码转换为纯 Spark SQL 查询:
from pyspark.sql.functions import unix_timestamp
df2 = df.withColumn(
"difference_duration",
unix_timestamp("CAL_COMPLETION_TIME") - unix_timestamp("Prev_Time")
)
示例数据帧 SS:
基本上我正在尝试在 Spark SQL:
中的 PGSQL 查询下实现此目的case
when t1.prev_time <> t1.prev_time_calc and t1."CAL_COMPLETION_TIME" - t1.prev_time < interval '30 min'
then t1.next_time_calc - t1.prev_time_calc
when (t1.next_time <> t1.next_time_calc and t1.next_time - t1."CAL_COMPLETION_TIME" < interval '30 min') or (t1.next_time - t1."CAL_COMPLETION_TIME" < interval '30 min')
then t1.next_time_calc - t1."CAL_COMPLETION_TIME"
else null
end min_diff
但是这部分 t1."CAL_COMPLETION_TIME" - t1.prev_time < interval '30 min' 抛出以下错误:
AnalysisException: "cannot resolve '(t1.`CAL_COMPLETION_TIME` - t1.`prev_time`)' due to data type mismatch: '(t1.`CAL_COMPLETION_TIME` - t1.`prev_time`)' requires (numeric or calendarinterval) type, not timestamp;
您不能减去时间戳,您需要将它们转换为秒。因此,您正在寻找的是将时间戳列转换为long/bigint如你所减,divide by 60
得到分钟值,然后看是不是less than 30
.
#example=df1
#both columns are of type Timestamp
+-------------------+-------------------+
| prev_time|CAL_COMPLETION_TIME|
+-------------------+-------------------+
|2019-04-26 01:19:10|2019-04-26 01:19:35|
+-------------------+-------------------+
Pyspark:
df1.withColumn("sub", F.when(((F.col("CAL_COMPLETION_TIME").cast("long")-F.col("prev_time").cast("long"))/60 < 30), F.lit("LESSTHAN30")).otherwise(F.lit("GREATERTHAN"))).show()
+-------------------+-------------------+----------+
| prev_time|CAL_COMPLETION_TIME| sub|
+-------------------+-------------------+----------+
|2019-04-26 01:19:10|2019-04-26 01:19:35|LESSTHAN30|
+-------------------+-------------------+----------+
Spark.sql
df1.createOrReplaceTempView("df1")
spark.sql("select prev_time, CAL_COMPLETION_TIME, IF(((CAST(CAL_COMPLETION_TIME as bigint) - CAST(prev_time as bigint))/60)<30,'LESSTHAN30','GREATER') as difference_duration from df1").show()
+-------------------+-------------------+-------------------+
| prev_time|CAL_COMPLETION_TIME|difference_duration|
+-------------------+-------------------+-------------------+
|2019-04-26 01:19:10|2019-04-26 01:19:35| LESSTHAN30|
+-------------------+-------------------+-------------------+