如何将 Spark DataFrame 的列的值列表与 collect_list() 聚合到 Pyspark 中的 3 维 Pandas?
How to export Spark DataFrame with columns having valuse lists aggregated with collect_list() to 3 dimentional Pandas in Pyspark?
我有这样的 DataFrame ()
+-----------+--------------------+------------+-------+
|device | windowtime | values| counts|
+-----------+--------------------+------------+-------+
| device_A|2022-01-01 18:00:00 |[99,100,102]|[1,3,1]|
| device_A|2022-01-01 18:00:10 |[98,100,101]|[1,2,2]|
Windowtime
被认为是X轴值,values
被认为是Y值,而counts
是Z轴值(稍后在热图上绘制)。
如何从 PySpark 数据帧将其导出到 Pandas 3d 对象?
有了“二维”,我有
pdf = df.toPandas()
然后我可以像这样用它来制作 Bokeh 的人物:
fig1ADB = figure(title="My 2 graph", tooltips=TOOLTIPS, x_axis_type='datetime')
fig1ADB.line(x='windowtime', y='values', source=source, color="orange")
但我想使用这样的东西:
hm = HeatMap(data, x='windowtime', y='values', values='counts', title='My heatmap (3d) graph', stat=None)
show(hm)
我应该为此做什么样的改造?
我已经意识到,这种方法本身是错误的,在导出到 Pandas 之前不应该对列表进行聚合!
根据下面的讨论
https://discourse.bokeh.org/t/cant-render-heatmap-data-for-apache-zeppelins-pyspark-dataframe/8844/8
我们没有分组列出列 values/counts,而是原始的 table 每个唯一 ID ('value') 和计数值 ('index') 一行行有它的 'write_time'
+-------------------+------+-----+
|window_time |values|index|
+-------------------+------+-----+
|2022-01-24 18:00:00|999 |2 |
|2022-01-24 19:00:00|999 |1 |
|2022-01-24 20:00:00|999 |3 |
|2022-01-24 21:00:00|999 |4 |
|2022-01-24 22:00:00|999 |5 |
|2022-01-24 18:00:00|998 |4 |
|2022-01-24 19:00:00|998 |5 |
|2022-01-24 20:00:00|998 |3 |
rowIDs = pdf['values']
colIDs = pdf['window_time']
A = pdf.pivot_table('index', 'values', 'window_time', fill_value=0)
source = ColumnDataSource(data={'x':[pd.to_datetime('Jan 24 2022')] #left most
,'y':[0] #bottom most
,'dw':[pdf['window_time'].max()-pdf['window_time'].min()] #TOTAL width of image
#,'dh':[df['delayWindowEnd'].max()] #TOTAL height of image
,'dh':[1000] #TOTAL height of image
,'im':[A.to_numpy()] #2D array using to_numpy() method on pivotted df
})
color_mapper = LogColorMapper(palette="Viridis256", low=1, high=20)
plot = figure(toolbar_location=None,x_axis_type='datetime')
plot.image(x='x', y='y', source=source, image='im',dw='dw',dh='dh', color_mapper=color_mapper)
color_bar = ColorBar(color_mapper=color_mapper, label_standoff=12)
plot.add_layout(color_bar, 'right')
#show(plot)
show(gridplot([plot], ncols=1, plot_width=1000, plot_height=400))
结果:
我有这样的 DataFrame (
+-----------+--------------------+------------+-------+
|device | windowtime | values| counts|
+-----------+--------------------+------------+-------+
| device_A|2022-01-01 18:00:00 |[99,100,102]|[1,3,1]|
| device_A|2022-01-01 18:00:10 |[98,100,101]|[1,2,2]|
Windowtime
被认为是X轴值,values
被认为是Y值,而counts
是Z轴值(稍后在热图上绘制)。
如何从 PySpark 数据帧将其导出到 Pandas 3d 对象?
有了“二维”,我有
pdf = df.toPandas()
然后我可以像这样用它来制作 Bokeh 的人物:
fig1ADB = figure(title="My 2 graph", tooltips=TOOLTIPS, x_axis_type='datetime')
fig1ADB.line(x='windowtime', y='values', source=source, color="orange")
但我想使用这样的东西:
hm = HeatMap(data, x='windowtime', y='values', values='counts', title='My heatmap (3d) graph', stat=None)
show(hm)
我应该为此做什么样的改造?
我已经意识到,这种方法本身是错误的,在导出到 Pandas 之前不应该对列表进行聚合!
根据下面的讨论
https://discourse.bokeh.org/t/cant-render-heatmap-data-for-apache-zeppelins-pyspark-dataframe/8844/8
我们没有分组列出列 values/counts,而是原始的 table 每个唯一 ID ('value') 和计数值 ('index') 一行行有它的 'write_time'
+-------------------+------+-----+
|window_time |values|index|
+-------------------+------+-----+
|2022-01-24 18:00:00|999 |2 |
|2022-01-24 19:00:00|999 |1 |
|2022-01-24 20:00:00|999 |3 |
|2022-01-24 21:00:00|999 |4 |
|2022-01-24 22:00:00|999 |5 |
|2022-01-24 18:00:00|998 |4 |
|2022-01-24 19:00:00|998 |5 |
|2022-01-24 20:00:00|998 |3 |
rowIDs = pdf['values']
colIDs = pdf['window_time']
A = pdf.pivot_table('index', 'values', 'window_time', fill_value=0)
source = ColumnDataSource(data={'x':[pd.to_datetime('Jan 24 2022')] #left most
,'y':[0] #bottom most
,'dw':[pdf['window_time'].max()-pdf['window_time'].min()] #TOTAL width of image
#,'dh':[df['delayWindowEnd'].max()] #TOTAL height of image
,'dh':[1000] #TOTAL height of image
,'im':[A.to_numpy()] #2D array using to_numpy() method on pivotted df
})
color_mapper = LogColorMapper(palette="Viridis256", low=1, high=20)
plot = figure(toolbar_location=None,x_axis_type='datetime')
plot.image(x='x', y='y', source=source, image='im',dw='dw',dh='dh', color_mapper=color_mapper)
color_bar = ColorBar(color_mapper=color_mapper, label_standoff=12)
plot.add_layout(color_bar, 'right')
#show(plot)
show(gridplot([plot], ncols=1, plot_width=1000, plot_height=400))
结果: