透视缺失值
Pivoting with missing values
我有一个 DataFrame
具有以下简单 schema
:
root
|-- amount: double (nullable = true)
|-- Date: timestamp (nullable = true)
我试图查看每天和每小时的 sum
金额,例如:
+---+--------+--------+ ... +--------+
|day| 0| 1| | 23|
+---+--------+--------+ ... +--------+
|148| 306.0| 106.0| | 0.0|
|243| 1906.0| 50.0| | 1.0|
| 31| 866.0| 100.0| | 0.0|
+---+--------+--------+ ... +--------+
嗯,首先我添加了一列 hour
然后我按天分组,然后按小时旋转。然而,我得到了一个例外,这可能与几个小时的销售缺失有关。这就是我要解决的问题,但我还没有意识到如何解决。
(df.withColumn("hour", hour("date"))
.groupBy(dayofyear("date").alias("day"))
.pivot("hour")
.sum("amount").show())
异常摘录。
AnalysisException: u'resolved attribute(s) date#3972 missing from
day#5367,hour#5354,sum(amount)#5437 in operator !Aggregate
[dayofyear(cast(date#3972 as date))], [dayofyear(cast(date#3972 as
date)) AS day#5367, pivotfirst(hour#5354, sum(amount)#5437, 0, 1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
22, 23, 0, 0) AS __pivot_sum(amount) AS sum(amount)#5487];'
问题未解决day
列。您可以在 groupBy
子句之外创建它来解决:
df = (sc
.parallelize([
(1.0, "2016-03-30 01:00:00"), (30.2, "2015-01-02 03:00:02")])
.toDF(["amount", "Date"])
.withColumn("Date", col("Date").cast("timestamp"))
.withColumn("hour", hour("date")))
with_day = df.withColumn("day", dayofyear("Date"))
with_day.groupBy("day").pivot("hour", range(0, 24)).sum("amount")
values
pivot
的参数是可选的,但建议使用。
我有一个 DataFrame
具有以下简单 schema
:
root
|-- amount: double (nullable = true)
|-- Date: timestamp (nullable = true)
我试图查看每天和每小时的 sum
金额,例如:
+---+--------+--------+ ... +--------+
|day| 0| 1| | 23|
+---+--------+--------+ ... +--------+
|148| 306.0| 106.0| | 0.0|
|243| 1906.0| 50.0| | 1.0|
| 31| 866.0| 100.0| | 0.0|
+---+--------+--------+ ... +--------+
嗯,首先我添加了一列 hour
然后我按天分组,然后按小时旋转。然而,我得到了一个例外,这可能与几个小时的销售缺失有关。这就是我要解决的问题,但我还没有意识到如何解决。
(df.withColumn("hour", hour("date"))
.groupBy(dayofyear("date").alias("day"))
.pivot("hour")
.sum("amount").show())
异常摘录。
AnalysisException: u'resolved attribute(s) date#3972 missing from day#5367,hour#5354,sum(amount)#5437 in operator !Aggregate [dayofyear(cast(date#3972 as date))], [dayofyear(cast(date#3972 as date)) AS day#5367, pivotfirst(hour#5354, sum(amount)#5437, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 0, 0) AS __pivot_sum(amount) AS sum(amount)#5487];'
问题未解决day
列。您可以在 groupBy
子句之外创建它来解决:
df = (sc
.parallelize([
(1.0, "2016-03-30 01:00:00"), (30.2, "2015-01-02 03:00:02")])
.toDF(["amount", "Date"])
.withColumn("Date", col("Date").cast("timestamp"))
.withColumn("hour", hour("date")))
with_day = df.withColumn("day", dayofyear("Date"))
with_day.groupBy("day").pivot("hour", range(0, 24)).sum("amount")
values
pivot
的参数是可选的,但建议使用。