Spark:如何按间隔划分间隔
Spark: How to divide interval by interval
我有一个具有以下结构的数据框:
timeStatistics.show(10, False)
+------+---------------------------------------+---------------------------------------+--------------------------------------+-----+
|idByte|min(time_delta) |max(time_delta) |avg(time_delta) |count|
+------+---------------------------------------+---------------------------------------+--------------------------------------+-----+
|1002b0|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b1|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b2|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b3|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b4|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b5|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b6|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b7|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1004b0|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1004b1|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
+------+---------------------------------------+---------------------------------------+--------------------------------------+-----+
only showing top 10 rows
我想添加一个列,给出 min(time_delta)
和 max(time_delta)
不同的因素。
我的第一次尝试是添加:
.withColumn("min_max_split", (F.col("max(time_delta)")/F.col("min(time_delta)")))
不过好像不支持分割两个区间:
AnalysisException: cannot resolve '(max(time_delta)
/
min(time_delta)
)' due to data type mismatch: argument 2 requires
numeric type, however, 'min(time_delta)
' is of interval day to
second type.
我想到了使用 unix_timestamp()
函数转换区间。但是,我的间隔有时小于一秒,所以 unix_timestamp()
会 return 为零。
可以把区间加到current_timestamp
上,把结果转成double型再除:
from pyspark.sql import functions as F
df1 = df.withColumn(
"min_max_split",
(F.current_timestamp() + F.col("max(time_delta)")).cast('double') / (
F.current_timestamp() + F.col("min(time_delta)")).cast('double')
)
df1.show(1)
#+------+--------------------+--------------------+--------------------+-----+------------------+
#|idByte| min(time_delta)| max(time_delta)| avg(time_delta)|count| min_max_split|
#+------+--------------------+--------------------+--------------------+-----+------------------+
#|1002b0|INTERVAL '0 00:00...|INTERVAL '0 00:00...|INTERVAL '0 00:00...| 4198|1.0000000000048699|
#+------+--------------------+--------------------+--------------------+-----+------------------+
我找到了一个解决方案,它是对@blackbishop 的回答稍作修改:
.withColumn("min_max_split",\
(\
(F.to_timestamp(F.from_unixtime(F.lit(0)))+F.col("min(time_delta)")).cast('double')\
/ (F.to_timestamp(F.from_unixtime(F.lit(0)))+F.col("max(time_delta)")).cast('double')\
)\
)
我有一个具有以下结构的数据框:
timeStatistics.show(10, False)
+------+---------------------------------------+---------------------------------------+--------------------------------------+-----+
|idByte|min(time_delta) |max(time_delta) |avg(time_delta) |count|
+------+---------------------------------------+---------------------------------------+--------------------------------------+-----+
|1002b0|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b1|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b2|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b3|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b4|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b5|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b6|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1002b7|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1004b0|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
|1004b1|INTERVAL '0 00:00:00.046' DAY TO SECOND|INTERVAL '0 00:00:00.054' DAY TO SECOND|INTERVAL '0 00:00:00.05' DAY TO SECOND|4198 |
+------+---------------------------------------+---------------------------------------+--------------------------------------+-----+
only showing top 10 rows
我想添加一个列,给出 min(time_delta)
和 max(time_delta)
不同的因素。
我的第一次尝试是添加:
.withColumn("min_max_split", (F.col("max(time_delta)")/F.col("min(time_delta)")))
不过好像不支持分割两个区间:
AnalysisException: cannot resolve '(
max(time_delta)
/min(time_delta)
)' due to data type mismatch: argument 2 requires numeric type, however, 'min(time_delta)
' is of interval day to second type.
我想到了使用 unix_timestamp()
函数转换区间。但是,我的间隔有时小于一秒,所以 unix_timestamp()
会 return 为零。
可以把区间加到current_timestamp
上,把结果转成double型再除:
from pyspark.sql import functions as F
df1 = df.withColumn(
"min_max_split",
(F.current_timestamp() + F.col("max(time_delta)")).cast('double') / (
F.current_timestamp() + F.col("min(time_delta)")).cast('double')
)
df1.show(1)
#+------+--------------------+--------------------+--------------------+-----+------------------+
#|idByte| min(time_delta)| max(time_delta)| avg(time_delta)|count| min_max_split|
#+------+--------------------+--------------------+--------------------+-----+------------------+
#|1002b0|INTERVAL '0 00:00...|INTERVAL '0 00:00...|INTERVAL '0 00:00...| 4198|1.0000000000048699|
#+------+--------------------+--------------------+--------------------+-----+------------------+
我找到了一个解决方案,它是对@blackbishop 的回答稍作修改:
.withColumn("min_max_split",\
(\
(F.to_timestamp(F.from_unixtime(F.lit(0)))+F.col("min(time_delta)")).cast('double')\
/ (F.to_timestamp(F.from_unixtime(F.lit(0)))+F.col("max(time_delta)")).cast('double')\
)\
)