如何使用 Apache Spark 获取特定值的出现率
How to get the occurence rate of the specific values with Apache Spark
我有这样的原始数据 DataFrame:
+-----------+--------------------+------+
|device | timestamp | value|
+-----------+--------------------+------+
| device_A|2022-01-01 18:00:01 | 100|
| device_A|2022-01-01 18:00:02 | 99|
| device_A|2022-01-01 18:00:03 | 100|
| device_A|2022-01-01 18:00:04 | 102|
| device_A|2022-01-01 18:00:05 | 100|
| device_A|2022-01-01 18:00:06 | 99|
| device_A|2022-01-01 18:00:11 | 98|
| device_A|2022-01-01 18:00:12 | 100|
| device_A|2022-01-01 18:00:13 | 100|
| device_A|2022-01-01 18:00:15 | 101|
| device_A|2022-01-01 18:00:17 | 101|
我想聚合它们并像这样构建列出的 10 秒聚合:
+-----------+--------------------+------------+-------+
|device | windowtime | values| counts|
+-----------+--------------------+------------+-------+
| device_A|2022-01-01 18:00:00 |[99,100,102]|[1,3,1]|
| device_A|2022-01-01 18:00:10 |[98,100,101]|[1,2,2]|
稍后绘制值的热图。
我已成功获得 values
列,但不清楚如何计算相应的 counts
.withColumn("values",collect_list(col("value")).over(Window.partitionBy($"device").orderBy($"timestamp".desc)))
如何在 Apache Spark 中进行加权列表聚合?
按时间 window 分组,使用持续时间为 10 seconds
的 window
函数按值和设备获取计数,然后按 device
+ window_time
分组并收集结构列表:
val result = (
df.groupBy(
$"device",
window($"timestamp", "10 second")("start").as("window_time"),
$"value"
)
.count()
.groupBy("device", "window_time")
.agg(collect_list(struct($"value", $"count")).as("values"))
.withColumn("count", col("values.count"))
.withColumn("values", col("values.value"))
)
result.show()
//+--------+-------------------+--------------+---------+
//| device| window_time| values| count|
//+--------+-------------------+--------------+---------+
//|device_A|2022-01-01 18:00:00|[102, 99, 100]|[1, 2, 3]|
//|device_A|2022-01-01 18:00:10|[100, 101, 98]|[2, 2, 1]|
//+--------+-------------------+--------------+---------+
我有这样的原始数据 DataFrame:
+-----------+--------------------+------+
|device | timestamp | value|
+-----------+--------------------+------+
| device_A|2022-01-01 18:00:01 | 100|
| device_A|2022-01-01 18:00:02 | 99|
| device_A|2022-01-01 18:00:03 | 100|
| device_A|2022-01-01 18:00:04 | 102|
| device_A|2022-01-01 18:00:05 | 100|
| device_A|2022-01-01 18:00:06 | 99|
| device_A|2022-01-01 18:00:11 | 98|
| device_A|2022-01-01 18:00:12 | 100|
| device_A|2022-01-01 18:00:13 | 100|
| device_A|2022-01-01 18:00:15 | 101|
| device_A|2022-01-01 18:00:17 | 101|
我想聚合它们并像这样构建列出的 10 秒聚合:
+-----------+--------------------+------------+-------+
|device | windowtime | values| counts|
+-----------+--------------------+------------+-------+
| device_A|2022-01-01 18:00:00 |[99,100,102]|[1,3,1]|
| device_A|2022-01-01 18:00:10 |[98,100,101]|[1,2,2]|
稍后绘制值的热图。
我已成功获得 values
列,但不清楚如何计算相应的 counts
.withColumn("values",collect_list(col("value")).over(Window.partitionBy($"device").orderBy($"timestamp".desc)))
如何在 Apache Spark 中进行加权列表聚合?
按时间 window 分组,使用持续时间为 10 seconds
的 window
函数按值和设备获取计数,然后按 device
+ window_time
分组并收集结构列表:
val result = (
df.groupBy(
$"device",
window($"timestamp", "10 second")("start").as("window_time"),
$"value"
)
.count()
.groupBy("device", "window_time")
.agg(collect_list(struct($"value", $"count")).as("values"))
.withColumn("count", col("values.count"))
.withColumn("values", col("values.value"))
)
result.show()
//+--------+-------------------+--------------+---------+
//| device| window_time| values| count|
//+--------+-------------------+--------------+---------+
//|device_A|2022-01-01 18:00:00|[102, 99, 100]|[1, 2, 3]|
//|device_A|2022-01-01 18:00:10|[100, 101, 98]|[2, 2, 1]|
//+--------+-------------------+--------------+---------+