如何将 Spark DataFrame 的列的值列表与 collect_list() 聚合到 Pyspark 中的 3 维 Pandas？

Question

我有这样的 DataFrame ()

+-----------+--------------------+------------+-------+
|device     | windowtime         |      values| counts|
+-----------+--------------------+------------+-------+
|   device_A|2022-01-01 18:00:00 |[99,100,102]|[1,3,1]|
|   device_A|2022-01-01 18:00:10 |[98,100,101]|[1,2,2]|

Windowtime被认为是X轴值，values被认为是Y值，而counts是Z轴值（稍后在热图上绘制）。

如何从 PySpark 数据帧将其导出到 Pandas 3d 对象？

有了“二维”，我有

pdf = df.toPandas()

然后我可以像这样用它来制作 Bokeh 的人物：

fig1ADB = figure(title="My 2 graph",  tooltips=TOOLTIPS, x_axis_type='datetime')
    fig1ADB.line(x='windowtime', y='values', source=source, color="orange")

但我想使用这样的东西：

hm = HeatMap(data, x='windowtime', y='values', values='counts', title='My heatmap (3d) graph', stat=None)
show(hm)

我应该为此做什么样的改造？

Answer 1

我已经意识到，这种方法本身是错误的，在导出到 Pandas 之前不应该对列表进行聚合！

根据下面的讨论

https://discourse.bokeh.org/t/cant-render-heatmap-data-for-apache-zeppelins-pyspark-dataframe/8844/8

我们没有分组列出列 values/counts，而是原始的 table 每个唯一 ID ('value') 和计数值 ('index') 一行行有它的 'write_time'

+-------------------+------+-----+
|window_time        |values|index|
+-------------------+------+-----+
|2022-01-24 18:00:00|999   |2    |
|2022-01-24 19:00:00|999   |1    |
|2022-01-24 20:00:00|999   |3    |
|2022-01-24 21:00:00|999   |4    |
|2022-01-24 22:00:00|999   |5    |
|2022-01-24 18:00:00|998   |4    |
|2022-01-24 19:00:00|998   |5    |
|2022-01-24 20:00:00|998   |3    |


rowIDs = pdf['values']
colIDs = pdf['window_time']

A = pdf.pivot_table('index', 'values', 'window_time', fill_value=0)

source = ColumnDataSource(data={'x':[pd.to_datetime('Jan 24 2022')] #left most
                           ,'y':[0] #bottom most 
                           ,'dw':[pdf['window_time'].max()-pdf['window_time'].min()] #TOTAL width of image
                           #,'dh':[df['delayWindowEnd'].max()] #TOTAL height of image
                           ,'dh':[1000] #TOTAL height of image
                           ,'im':[A.to_numpy()] #2D array using to_numpy() method on pivotted df
                           })


color_mapper = LogColorMapper(palette="Viridis256", low=1, high=20)

plot = figure(toolbar_location=None,x_axis_type='datetime')
plot.image(x='x', y='y', source=source, image='im',dw='dw',dh='dh',  color_mapper=color_mapper)

color_bar = ColorBar(color_mapper=color_mapper, label_standoff=12)

plot.add_layout(color_bar, 'right')

#show(plot)
show(gridplot([plot], ncols=1, plot_width=1000, plot_height=400))

结果：

如何将 Spark DataFrame 的列的值列表与 collect_list() 聚合到 Pyspark 中的 3 维 Pandas？

How to export Spark DataFrame with columns having valuse lists aggregated with collect_list() to 3 dimentional Pandas in Pyspark?

pandas

bokeh

apache-spark

apache-spark-sql

pyspark