使用相同的边比较不同的 histogram2d binnings

Question

我有一个如下所示的数据集：

tsne_results_x  tsne_results_y  team_id
0   -22.796648  -26.514051  107
1   11.985229   40.674446   107
2   -28.231720  -49.302216  107
3   31.942875   -14.427114  107
4   -46.436501  -7.750005   107
76  24.252718   -20.551889  8071
77  2.362172    17.170067   8071
78  7.212677    -9.056982   8071
79  -5.865472   -32.999077  8071

我想合并 tsne_results_x 和 tsne_results_y 列，为此我正在使用 numpy 函数 histogram2d

grid, xe, ye = np.histogram2d(df['tsne_results_x'], df['tsne_results_y'], bins=15)
gridx = np.linspace(min(df['tsne_results_x']),max(df['tsne_results_x']),15)
gridy = np.linspace(min(df['tsne_results_y']),max(df['tsne_results_y']),15)

plt.figure()
#plt.plot(x, y, 'ro')
plt.grid(True)

#plt.figure()
plt.pcolormesh(gridx, gridy, grid)
plt.colorbar()

plt.show()

但是，如您所见，我在数据框中有几个 team_id，我想将一个团队的各个 bin 与整个数据框进行比较。例如，对于一个团队，在一个特定的 bin 中，我想将其除以包括所有团队的总数。

所以，我认为运行ning histogram2d 在特定团队数据集上，对整个数据集使用相同的行空间就可以了。它不会，因为 histogram2d 将对 one_team_df 进行不同的分类，因为数据具有不同的范围

one_team_df = df.loc[(df['team_id'] == str(299))]

grid_team, a, b = np.histogram2d(one_team_df['tsne_results_x'], one_team_df['tsne_results_y'], bins=15)



gridx = np.linspace(min(df['tsne_results_x']),max(df['tsne_results_x']),15)
gridy = np.linspace(min(df['tsne_results_y']),max(df['tsne_results_y']),15)

plt.figure()
#plt.plot(x, y, 'ro')
plt.grid(True)

#plt.figure()
plt.pcolormesh(gridx, gridy, grid_team)
#plt.plot(x, y, 'ro')
plt.colorbar()

plt.show()

我想知道如何使这两个表示具有可比性。是否可以运行 histogram2d 给出 xedges 和 yedges ？这样我就可以使用整体装箱的边缘对一个团队进行装箱。

此致

Answer 1

documentation of np.histomgram2d

binsint or array_like or [int, int] or [array, array], optional
The bin specification:

If int, the number of bins for the two dimensions (nx=ny=bins).

If array_like, the bin edges for the two dimensions (x_edges=y_edges=bins).

If [int, int], the number of bins in each dimension (nx, ny = bins).

If [array, array], the bin edges in each dimension (x_edges, y_edges = bins).

A combination [int, array] or [array, int], where int is the number of bins and array is the bin edges.

这意味着您可以根据需要指定垃圾箱。例如：

grid_team, a, b = np.histogram2d(
    one_team_df['tsne_results_x'], one_team_df['tsne_results_y'], 
    bins=[np.linspace(-40,40,15), np.linspace(-40,40,15)]
)

使用相同的边比较不同的 histogram2d binnings

Compare different histogram2d binnings using the same edges

python

pandas

histogram2d