为 Spark 中的每一分钟差异创建一个新行 SQL
Create a new row for each minute of difference in Spark SQL
考虑我的数据:
+---+-------------------+-------------------+
| id| starttime| endtime|
+---+-------------------+-------------------+
| 1|1970-01-01 07:00:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
基于此,我想要一个 sql 查询,以我的数据准确结束的方式为结束时间和开始时间之间的每一分钟差异创建一行像这样:
+---+-------------------+-------------------+
| id| starttime| endtime|
+---+-------------------+-------------------+
| 1|1970-01-01 07:00:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
| 1|1970-01-01 07:01:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
| 1|1970-01-01 07:02:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
| 1|1970-01-01 07:03:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
我比较喜欢sql,但如果不行,你可以用pyspark。
试试这个:
import pyspark.sql.functions as f
df.show()
+---+-------------------+-------------------+
| id| starttime| endtime|
+---+-------------------+-------------------+
| 1|1970-01-01 07:00:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
#df.printSchema()
# root
# |-- id: long (nullable = true)
# |-- starttime: timestamp (nullable = true)
# |-- endtime: timestamp (nullable = true)
expr
和 sequence
的组合与一分钟的间隔将为您提供分钟的时间戳数组,然后 explode
将其转换为行。
df.select('id', f.explode(f.expr('sequence(starttime, endtime, interval 1 minute)')).alias('starttime'), 'endtime' ).show(truncate=False)
+---+-------------------+-------------------+
|id |starttime |endtime |
+---+-------------------+-------------------+
|1 |1970-01-01 07:00:00|1970-01-01 07:03:00|
|1 |1970-01-01 07:01:00|1970-01-01 07:03:00|
|1 |1970-01-01 07:02:00|1970-01-01 07:03:00|
|1 |1970-01-01 07:03:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
对于 Spark 2.4+,您可以使用 sequence
函数生成范围日期数组,然后展开它:
SELECT id,
explode(sequence(to_timestamp(starttime), to_timestamp(endtime), interval 1 minute)) AS starttime,
endtime
FROM my_table
df = spark.createDataFrame([(1, "1970-01-01 07:00:00", "1970-01-01 07:03:00")], ["id", "starttime", "endtime"])
df.createOrReplaceTempView("my_table")
sql_query = """SELECT id,
explode(sequence(to_timestamp(starttime), to_timestamp(endtime), interval 1 minute)) as starttime,
endtime
FROM my_table
"""
spark.sql(sql_query).show()
#+---+-------------------+-------------------+
#| id| starttime| endtime|
#+---+-------------------+-------------------+
#| 1|1970-01-01 07:00:00|1970-01-01 07:03:00|
#| 1|1970-01-01 07:01:00|1970-01-01 07:03:00|
#| 1|1970-01-01 07:02:00|1970-01-01 07:03:00|
#| 1|1970-01-01 07:03:00|1970-01-01 07:03:00|
#+---+-------------------+-------------------+
考虑我的数据:
+---+-------------------+-------------------+
| id| starttime| endtime|
+---+-------------------+-------------------+
| 1|1970-01-01 07:00:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
基于此,我想要一个 sql 查询,以我的数据准确结束的方式为结束时间和开始时间之间的每一分钟差异创建一行像这样:
+---+-------------------+-------------------+
| id| starttime| endtime|
+---+-------------------+-------------------+
| 1|1970-01-01 07:00:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
| 1|1970-01-01 07:01:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
| 1|1970-01-01 07:02:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
| 1|1970-01-01 07:03:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
我比较喜欢sql,但如果不行,你可以用pyspark。
试试这个:
import pyspark.sql.functions as f
df.show()
+---+-------------------+-------------------+
| id| starttime| endtime|
+---+-------------------+-------------------+
| 1|1970-01-01 07:00:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
#df.printSchema()
# root
# |-- id: long (nullable = true)
# |-- starttime: timestamp (nullable = true)
# |-- endtime: timestamp (nullable = true)
expr
和 sequence
的组合与一分钟的间隔将为您提供分钟的时间戳数组,然后 explode
将其转换为行。
df.select('id', f.explode(f.expr('sequence(starttime, endtime, interval 1 minute)')).alias('starttime'), 'endtime' ).show(truncate=False)
+---+-------------------+-------------------+
|id |starttime |endtime |
+---+-------------------+-------------------+
|1 |1970-01-01 07:00:00|1970-01-01 07:03:00|
|1 |1970-01-01 07:01:00|1970-01-01 07:03:00|
|1 |1970-01-01 07:02:00|1970-01-01 07:03:00|
|1 |1970-01-01 07:03:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
对于 Spark 2.4+,您可以使用 sequence
函数生成范围日期数组,然后展开它:
SELECT id,
explode(sequence(to_timestamp(starttime), to_timestamp(endtime), interval 1 minute)) AS starttime,
endtime
FROM my_table
df = spark.createDataFrame([(1, "1970-01-01 07:00:00", "1970-01-01 07:03:00")], ["id", "starttime", "endtime"])
df.createOrReplaceTempView("my_table")
sql_query = """SELECT id,
explode(sequence(to_timestamp(starttime), to_timestamp(endtime), interval 1 minute)) as starttime,
endtime
FROM my_table
"""
spark.sql(sql_query).show()
#+---+-------------------+-------------------+
#| id| starttime| endtime|
#+---+-------------------+-------------------+
#| 1|1970-01-01 07:00:00|1970-01-01 07:03:00|
#| 1|1970-01-01 07:01:00|1970-01-01 07:03:00|
#| 1|1970-01-01 07:02:00|1970-01-01 07:03:00|
#| 1|1970-01-01 07:03:00|1970-01-01 07:03:00|
#+---+-------------------+-------------------+