如何在 Bokeh 中为 Spark 中计算的定时 window 事件构建值密度热图？

Question

根据可以像这样汇总每个时间单位的值出现次数：

+---------+----------+------------------------+------------+------+
|device_id|read_date |ids                     |counts      |top_id|
+---------+----------+------------------------+------------+------+
|device_A |2017-08-05|[4041]                  |[3]         |4041  |
|device_A |2017-08-06|[4041, 4041]            |[3, 3]      |4041  |
|device_A |2017-08-07|[4041, 4041, 4041]      |[3, 3, 4]   |4041  |
|device_A |2017-08-08|[4041, 4041, 4041]      |[3, 4, 3]   |4041  |
|device_A |2017-08-09|[4041, 4041, 4041]      |[4, 3, 3]   |4041  |
|device_A |2017-08-10|[4041, 4041, 4041, 4045]|[3, 3, 1, 2]|4041  |
|device_A |2017-08-11|[4041, 4041, 4045, 4045]|[3, 1, 2, 3]|4045  |
|device_A |2017-08-12|[4041, 4045, 4045, 4045]|[1, 2, 3, 3]|4045  |
|device_A |2017-08-13|[4045, 4045, 4045]      |[3, 3, 3]   |4045  |
+---------+----------+------------------------+------------+------+

我想在 Zeppelin 中绘制 X 为 read_time，Y 为整数 ID 值，计数将其转换为热图。我如何使用 Bokeh 和 pandas?

绘制它

Answer 1

这种 DataFrame 基于更普通的 DataFrame，其中 id 和计数未分组到数组中。使用非分组 DataFrame 与 Bokeh 构建更方便：

https://discourse.bokeh.org/t/cant-render-heatmap-data-for-apache-zeppelins-pyspark-dataframe/8844/8

我们没有分组列出列 ids/counts，而是原始的 table 每个唯一 ID ('value') 和计数值 ('index') 一行行有它的 'write_time'

rowIDs = pdf['values']
colIDs = pdf['window_time']

A = pdf.pivot_table('index', 'values', 'window_time', fill_value=0)

source = ColumnDataSource(data={'x':[pd.to_datetime('Jan 24 2022')] #left most
                           ,'y':[0] #bottom most 
                           ,'dw':[pdf['window_time'].max()-pdf['window_time'].min()] #TOTAL width of image
                           #,'dh':[pdf['delayWindowEnd'].max()] #TOTAL height of image
                           ,'dh':[1000] #TOTAL height of image
                           ,'im':[A.to_numpy()] #2D array using to_numpy() method on pivotted df
                           })


color_mapper = LogColorMapper(palette="Viridis256", low=1, high=20)

plot = figure(toolbar_location=None,x_axis_type='datetime')
plot.image(x='x', y='y', source=source, image='im',dw='dw',dh='dh',  color_mapper=color_mapper)

color_bar = ColorBar(color_mapper=color_mapper, label_standoff=12)

plot.add_layout(color_bar, 'right')

#show(plot)
show(gridplot([plot], ncols=1, plot_width=1000, high=pdf['index'].max()))

结果：

如何在 Bokeh 中为 Spark 中计算的定时 window 事件构建值密度热图？

How to build a values density heatmap in Bokeh for timed window occurencies calculated in Spark?

python

bokeh

apache-spark

apache-zeppelin

pandas-bokeh