Return 计数大于阈值的所有 bin 的数据索引

Question

我正在尝试在分箱的数据的某个分箱内找到所有索引喜欢这个：

import numpy as np

x=np.random.random(1000)
y=np.random.random(1000)
#The bins are not evenly spaced and not the same number in x and y. 
xedges=np.array(0.1,0.2, 0.4, 0.5, 0.55, 0.6, 0.8, 0.9)
yedges=np.arange(0.1,0.2, 0.4, 0.5, 0.55, 0.6, 0.8, 0.9)

h=np.histogram2d(x,y, bins=[xedges,yedges])

我想找到每个 bin 中包含的大于某个计数阈值的索引（然后绘制它们等）。因此，每个计数大于阈值的 bin 都是 "cluster"，我想知道该集群中的所有数据点 (x,y)。

我用伪代码写了我认为它是如何工作的。

thres=5 
mask=(h>5)

for i in mask:
    # for each bin with count > thres 
    # get bin edges for x and y directions 

    # find  (rightEdge < x < leftEdge) and (rightEdge < y < leftEdge)

    # return indices for each True in mask 

plt.plot(x[indices], y[indicies])

我尝试阅读 scipy.stats.binned_statistic2d and pandas.DataFrame.groupby 等函数的文档，但我不知道如何将其应用到我的数据中。对于 binned_statistic2d 他们要求一个参数 values :

The data on which the statistic will be computed. This must be the same shape as x, or a set of sequences - each the same shape as x.

而且我不确定如何输入我希望用来计算的数据。

感谢您在此问题上提供的任何帮助。

Answer 1

如果我没理解错的话，你想在原始点上构建一个掩码，表明该点属于超过 5 个点的 bin。

要构建这样的掩码，np.histogram2d returns 每个 bin 的计数，但不指示哪个点进入哪个 bin。

您可以通过遍历每个满足条件的 bin 来构建这样的掩码，并将所有对应的点索引添加到掩码中。

要可视化 np.histogram2d 的结果，可以使用 plt.pcolormesh。使用 h > 5 绘制网格将显示所有具有最高颜色（红色）的 True 值和具有最低颜色（蓝色）的 False 值。

from matplotlib import pyplot as plt
import numpy as np

x = np.random.uniform(0, 2, 500)
y = np.random.uniform(0, 1, x.shape)

xedges = np.array([0.1, 0.2, 0.5, 0.55, 0.6, 0.8, 1.0, 1.3, 1.5, 1.9])
yedges = np.array([0.1, 0.2, 0.4, 0.5, 0.55, 0.6, 0.8, 0.9])

hist, _xedges, _yedges = np.histogram2d(x, y, bins=[xedges, yedges])

h = hist.T  # np.histogram2d transposes x and y, therefore, transpose the resulting array
thres = 5
desired = h > thres
plt.pcolormesh(xedges, yedges, desired, cmap='coolwarm', ec='white', lw=2)

mask = np.zeros_like(x, dtype=np.bool)  # start with mask all False
for i in range(len(xedges) - 1):
    for j in range(len(yedges) - 1):
        if desired[j, i]:
            # print(f'x from {xedges[i]} to {xedges[i + 1]} y from {yedges[j]} to {yedges[j + 1]}')
            mask = np.logical_or(mask, (x >= xedges[i]) & (x < xedges[i + 1]) & (y >= yedges[j]) & (y < yedges[j + 1]))
            # plt.scatter(np.random.uniform(xedges[i], xedges[i+1], 100), np.random.uniform(yedges[j], yedges[j+1], 100),
            #             marker='o', color='g', alpha=0.3)
plt.scatter(x, y, marker='o', color='gold', label='initial points')
plt.scatter(x[mask], y[mask], marker='.', color='green', label='filtered points')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left')
plt.tight_layout()
plt.show()

请注意，在给定的示例中，边并未覆盖点的完整范围。将不考虑给定边缘之外的点。要包括这些点，只需扩展边缘即可。

Return 计数大于阈值的所有 bin 的数据索引

Return data indices for all bins with counts greater than threshold

python

numpy

histogram

scipy

binning