通过更新同列实现滞后函数
Implementation of lag function by updating the same column
我要更新条形码(offset=1
)的滞后值到条形码
case
when ( lag(barcode,1) over (order by barcode )
and Datediff(SS, eventdate,lag(next_eventdate,1) over (order by barcode)) < 3*3600 )
THEN 1
ELSE 0
END as FLAG
我已经在 pyspark 上实现了它,但是给我一个错误
from pyspark.sql.functions import col, unix_timestamp
timeDiff = unix_timestamp('eventdate', format="ss")- unix_timestamp(F.lag('next_eventdate', 1), format="ss")
ww= Window.orderBy("barcode")
Tgt_df_tos = Tgt_df_7.withColumn('FLAG',F.when((F.lag('barcode', 1)) & ( timeDiff <= 10800),"1").otherwise('0'))
我收到错误
AnalysisException: "cannot resolve '(lag(`barcode`, 1, NULL) AND ((unix_timestamp(`eventdate`, 'ss') - unix_timestamp(lag(`next_eventdate`, 1, NULL), 'ss')) <= CAST(10800 AS BIGINT)))' due to data type mismatch: differing types in '(lag(`barcode`, 1, NULL) AND ((unix_timestamp(`eventdate`, 'ss') - unix_timestamp(lag(`next_eventdate`, 1, NULL), 'ss')) <= CAST(10800 AS BIGINT)))' (int and boolean).
我不熟悉 pyspark,但在我看来问题出在 CASE 语句中。
CASE WHEN (
LAG(barcode,1) OVER (ORDER BY barcode )
AND
DATEDIFF(SS, eventdate, LAG(next_eventdate, 1) OVER(ORDER BY barcode)) < 3*3600
)
有两种表达方式:
"LAG(barcode,1) OVER (ORDER BY barcode )" 计算结果为 INTEGER。
"DATEDIFF(SS, eventdate, LAG(next_eventdate, 1) OVER(ORDER BY barcode)) < 3*3600" 计算结果为布尔值(因为不等式)。
这些表达式与通常用于组合两个布尔表达式的 AND 运算符组合。我相信这是错误的原因。
LAG(barcode,1) OVER (ORDER BY barcode ) 计算结果为整数而非布尔值。
所以表达式看起来像这样:
CASE WHEN (324857 AND True) THEN 1 ELSE 0 END as FLAG
AnalysisException: "cannot resolve .... (int and boolean).
我要更新条形码(offset=1
)的滞后值到条形码
case
when ( lag(barcode,1) over (order by barcode )
and Datediff(SS, eventdate,lag(next_eventdate,1) over (order by barcode)) < 3*3600 )
THEN 1
ELSE 0
END as FLAG
我已经在 pyspark 上实现了它,但是给我一个错误
from pyspark.sql.functions import col, unix_timestamp
timeDiff = unix_timestamp('eventdate', format="ss")- unix_timestamp(F.lag('next_eventdate', 1), format="ss")
ww= Window.orderBy("barcode")
Tgt_df_tos = Tgt_df_7.withColumn('FLAG',F.when((F.lag('barcode', 1)) & ( timeDiff <= 10800),"1").otherwise('0'))
我收到错误
AnalysisException: "cannot resolve '(lag(`barcode`, 1, NULL) AND ((unix_timestamp(`eventdate`, 'ss') - unix_timestamp(lag(`next_eventdate`, 1, NULL), 'ss')) <= CAST(10800 AS BIGINT)))' due to data type mismatch: differing types in '(lag(`barcode`, 1, NULL) AND ((unix_timestamp(`eventdate`, 'ss') - unix_timestamp(lag(`next_eventdate`, 1, NULL), 'ss')) <= CAST(10800 AS BIGINT)))' (int and boolean).
我不熟悉 pyspark,但在我看来问题出在 CASE 语句中。
CASE WHEN (
LAG(barcode,1) OVER (ORDER BY barcode )
AND
DATEDIFF(SS, eventdate, LAG(next_eventdate, 1) OVER(ORDER BY barcode)) < 3*3600
)
有两种表达方式: "LAG(barcode,1) OVER (ORDER BY barcode )" 计算结果为 INTEGER。
"DATEDIFF(SS, eventdate, LAG(next_eventdate, 1) OVER(ORDER BY barcode)) < 3*3600" 计算结果为布尔值(因为不等式)。
这些表达式与通常用于组合两个布尔表达式的 AND 运算符组合。我相信这是错误的原因。
LAG(barcode,1) OVER (ORDER BY barcode ) 计算结果为整数而非布尔值。
所以表达式看起来像这样:
CASE WHEN (324857 AND True) THEN 1 ELSE 0 END as FLAG
AnalysisException: "cannot resolve .... (int and boolean).