如何扩展基于列的 Pyspark 数据框?
How to expand out a Pyspark dataframe based on column?
如何根据列值扩展数据框?我打算从这个数据框开始:
+---------+----------+----------+
|DEVICE_ID| MIN_DATE| MAX_DATE|
+---------+----------+----------+
| 1|2019-08-29|2019-08-31|
| 2|2019-08-27|2019-09-02|
+---------+----------+----------+
给看起来像这样的人:
+---------+----------+
|DEVICE_ID| DATE|
+---------+----------+
| 1|2019-08-29|
| 1|2019-08-30|
| 1|2019-08-31|
| 2|2019-08-27|
| 2|2019-08-28|
| 2|2019-08-29|
| 2|2019-08-30|
| 2|2019-08-31|
| 2|2019-09-01|
| 2|2019-09-02|
+---------+----------+
如有任何帮助,我们将不胜感激。
from datetime import timedelta, date
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType
# Create a sample data row.
df = sqlContext.sql("""
select 'dev1' as device_id,
to_date('2020-01-06') as start,
to_date('2020-01-09') as end""")
# Define a UDf to return a list of dates
@udf
def datelist(start, end):
return ",".join([str(start + datetime.timedelta(days=x)) for x in range(0, 1+(end-start).days)])
# explode the list of dates into rows
df.select("device_id",
F.explode(
F.split(datelist(df["start"], df["end"]), ","))
.alias("date")).show(10, False)
如何根据列值扩展数据框?我打算从这个数据框开始:
+---------+----------+----------+
|DEVICE_ID| MIN_DATE| MAX_DATE|
+---------+----------+----------+
| 1|2019-08-29|2019-08-31|
| 2|2019-08-27|2019-09-02|
+---------+----------+----------+
给看起来像这样的人:
+---------+----------+
|DEVICE_ID| DATE|
+---------+----------+
| 1|2019-08-29|
| 1|2019-08-30|
| 1|2019-08-31|
| 2|2019-08-27|
| 2|2019-08-28|
| 2|2019-08-29|
| 2|2019-08-30|
| 2|2019-08-31|
| 2|2019-09-01|
| 2|2019-09-02|
+---------+----------+
如有任何帮助,我们将不胜感激。
from datetime import timedelta, date
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType
# Create a sample data row.
df = sqlContext.sql("""
select 'dev1' as device_id,
to_date('2020-01-06') as start,
to_date('2020-01-09') as end""")
# Define a UDf to return a list of dates
@udf
def datelist(start, end):
return ",".join([str(start + datetime.timedelta(days=x)) for x in range(0, 1+(end-start).days)])
# explode the list of dates into rows
df.select("device_id",
F.explode(
F.split(datelist(df["start"], df["end"]), ","))
.alias("date")).show(10, False)