如何获取列中值为 1 的半小时前每一行的计数?
How to get the count for each row before half hour period having the value of 1 in a column?
我有如下数据框:
id time type day
___ _____ _____ ____
1 2016-10-12 01:45:01 1 3
1 2016-10-12 01:48:01 0 3
1 2016-10-12 01:50:01 1 3
1 2016-10-12 01:52:01 1 3
2 2016-10-12 01:53:01 1 3
2 2016-10-12 02:10:01 1 3
3 2016-10-12 01:45:01 1 3
3 2016-10-12 01:48:01 1 3
从这个数据框中,我想计算每行半小时前该 ID 中类型 1 的出现次数。
例如,如果我们取第一行
1 2016-10-12 01:45:01 1 3
据此,我想计算该 ID 中从 2016-10-12 01:45:01
到 2016-10-12 01:15:01
的类型 1 出现次数,最终为 0,因为它是第一条记录。
id time type day count_of_type1
___ _____ _____ ____ ______________
1 2016-10-12 01:45:01 1 3 0
如果我们拿第三行
1 2016-10-12 01:50:01 1 3
据此,我想计算该 ID 中从 2016-10-12 01:50:01
到 2016-10-12 01:20:01
的类型 1 出现次数,最终为 2。
id time type day count_of_type1
___ _____ _____ ____ ______________
1 2016-10-12 01:50:01 1 3 2
我阅读了如下数据框,也知道如何进行计数,但我不确定的部分是如何分别为每一行附加列:
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("hdfs:///user/rkr/datafile.csv")
感谢任何帮助。
您可以使用自联接来获取具有基于时间戳的条件的行。
val ds = Seq(( 1, "2016-10-12 01:45:01", 1, 3),
( 1, "2016-10-12 01:48:01", 0, 3),
( 1, "2016-10-12 01:50:01", 1, 3),
( 1, "2016-10-12 01:52:01", 1, 3),
( 2, "2016-10-12 01:53:01", 1, 3),
( 2, "2016-10-12 02:10:01", 1, 3),
( 3, "2016-10-12 01:45:01", 1, 3),
( 3, "2016-10-12 01:45:01", 1, 3)).
toDF("id", "time", "type", "day")
.withColumn("timestamp", unix_timestamp($"time", "yyyy-MM-dd HH:mm:ss"))
val happenBeforeHalfHour = ds.as("left").join(ds.as("right"), $"left.id" === $"right.id" && $"right.type" === 1 &&
$"left.timestamp" > $"right.timestamp" && $"left.timestamp" - $"right.timestamp" <= 1800)
.select($"left.id", $"left.time", $"left.type", $"left.day",$"left.timestamp")
happenBeforeHalfHour.groupBy("id", "time", "type", "day", "timestamp").count.show(false)
+---+-------------------+----+---+----------+-----+
|id |time |type|day|timestamp |count|
+---+-------------------+----+---+----------+-----+
|1 |2016-10-12 01:48:01|0 |3 |1476211681|1 |
|2 |2016-10-12 02:10:01|1 |3 |1476213001|1 |
|1 |2016-10-12 01:52:01|1 |3 |1476211921|2 |
|1 |2016-10-12 01:50:01|1 |3 |1476211801|1 |
+---+-------------------+----+---+----------+-----+
我有如下数据框:
id time type day
___ _____ _____ ____
1 2016-10-12 01:45:01 1 3
1 2016-10-12 01:48:01 0 3
1 2016-10-12 01:50:01 1 3
1 2016-10-12 01:52:01 1 3
2 2016-10-12 01:53:01 1 3
2 2016-10-12 02:10:01 1 3
3 2016-10-12 01:45:01 1 3
3 2016-10-12 01:48:01 1 3
从这个数据框中,我想计算每行半小时前该 ID 中类型 1 的出现次数。 例如,如果我们取第一行
1 2016-10-12 01:45:01 1 3
据此,我想计算该 ID 中从 2016-10-12 01:45:01
到 2016-10-12 01:15:01
的类型 1 出现次数,最终为 0,因为它是第一条记录。
id time type day count_of_type1
___ _____ _____ ____ ______________
1 2016-10-12 01:45:01 1 3 0
如果我们拿第三行
1 2016-10-12 01:50:01 1 3
据此,我想计算该 ID 中从 2016-10-12 01:50:01
到 2016-10-12 01:20:01
的类型 1 出现次数,最终为 2。
id time type day count_of_type1
___ _____ _____ ____ ______________
1 2016-10-12 01:50:01 1 3 2
我阅读了如下数据框,也知道如何进行计数,但我不确定的部分是如何分别为每一行附加列:
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("hdfs:///user/rkr/datafile.csv")
感谢任何帮助。
您可以使用自联接来获取具有基于时间戳的条件的行。
val ds = Seq(( 1, "2016-10-12 01:45:01", 1, 3),
( 1, "2016-10-12 01:48:01", 0, 3),
( 1, "2016-10-12 01:50:01", 1, 3),
( 1, "2016-10-12 01:52:01", 1, 3),
( 2, "2016-10-12 01:53:01", 1, 3),
( 2, "2016-10-12 02:10:01", 1, 3),
( 3, "2016-10-12 01:45:01", 1, 3),
( 3, "2016-10-12 01:45:01", 1, 3)).
toDF("id", "time", "type", "day")
.withColumn("timestamp", unix_timestamp($"time", "yyyy-MM-dd HH:mm:ss"))
val happenBeforeHalfHour = ds.as("left").join(ds.as("right"), $"left.id" === $"right.id" && $"right.type" === 1 &&
$"left.timestamp" > $"right.timestamp" && $"left.timestamp" - $"right.timestamp" <= 1800)
.select($"left.id", $"left.time", $"left.type", $"left.day",$"left.timestamp")
happenBeforeHalfHour.groupBy("id", "time", "type", "day", "timestamp").count.show(false)
+---+-------------------+----+---+----------+-----+
|id |time |type|day|timestamp |count|
+---+-------------------+----+---+----------+-----+
|1 |2016-10-12 01:48:01|0 |3 |1476211681|1 |
|2 |2016-10-12 02:10:01|1 |3 |1476213001|1 |
|1 |2016-10-12 01:52:01|1 |3 |1476211921|2 |
|1 |2016-10-12 01:50:01|1 |3 |1476211801|1 |
+---+-------------------+----+---+----------+-----+