scipy.stats.binned_statistic_dd() bin 编号有很多额外的 bin

scipy.stats.binned_statistic_dd() bin numbering has lots of extra bins

我正在努力处理 scipy.stats.binned_statistic_dd() 结果。我有一个位置数组和另一个 id 数组,我在 3 个方向上装箱。我提供了一个 bin 边缘列表作为输入,而不是每个方向上的多个 bin 加上一个范围选项。我在 x 中有 3 个箱子,在 y 中有 2 个箱子,在 z 中有 3 个箱子,或 18 个箱子。

但是,当我检查列出的 binnumbers 时,它们都在大于 20 的范围内。如何获得 bin 编号以反映提供的 bin 数量并删除所有额外的 bin?

我已尝试遵循处理类似问题的 post (Output in scipy.stats.binned_statistic_dd()) 中的建议,但我不明白如何将其应用到我的案例中。像往常一样,文档一如既往地神秘。

在此示例中,如能帮助我获取 1-18 之间的 binnumbers,我们将不胜感激!

pos = np.array([[-0.02042167, -0.0223282 ,  0.00123734],
       [-0.0420364 ,  0.01196078,  0.00694259],
       [-0.09625651, -0.00311446,  0.06125461],
       [-0.07693234, -0.02749618,  0.03617278],
       [-0.07578646,  0.01199925,  0.02991888],
       [-0.03258293, -0.00371765,  0.04245596],
       [-0.06765955,  0.02798434,  0.07075846],
       [-0.02431445,  0.02774102,  0.06719837],
       [ 0.02798265, -0.01096739, -0.01658691],
       [-0.00584252,  0.02043389, -0.00827088],
       [ 0.00623063, -0.02642285,  0.03232817],
       [ 0.00884222,  0.01498996,  0.02912483],
       [ 0.07189474, -0.01541584,  0.01916607],
       [ 0.07239394,  0.0059483 ,  0.0740187 ],
       [-0.08519159, -0.02894125,  0.10923724],
       [-0.10803509,  0.01365444,  0.09555333],
       [-0.0442866 , -0.00845725,  0.10361843],
       [-0.04246779,  0.00396127,  0.1418258 ],
       [-0.08975861,  0.02999023,  0.12713186],
       [ 0.01772454, -0.0020405 ,  0.08824418]])

ids = np.array([16,  9,  6, 19,  1,  4, 10,  5, 18, 11,  2, 12, 13,  8,  3, 17, 14,
       15, 20,  7])

xbinEdges = np.array([-0.15298488, -0.05108961,  0.05080566,  0.15270093])
ybinEdges = np.array([-0.051,  0.   ,  0.051])
zbinEdges = np.array([-0.053,  0.049,  0.151,  0.253])

ret = stats.binned_statistic_dd(pos, ids, bins=[xbinEdges, ybinEdges, zbinEdges],
                                statistic='count', expand_binnumbers=False)
bincounts = ret.statistic
binnumber = ret.binnumber.T

>>> binnumber  = array([46, 51, 27, 26, 31, 46, 32, 52, 46, 51, 46, 51, 66, 72, 27, 32, 47,
       52, 32, 47], dtype=int64)

ranges = [[-0.15298488071, 0.15270092971],
 [-0.051000000000000004, 0.051000000000000004],
 [-0.0530000000000001, 0.25300000000000006]]

ret3 = stats.binned_statistic_dd(pos, ids, bins=(3,2,3), statistic='count', expand_binnumbers=False, range=ranges)
bincounts = ret3.statistic
binnumber = ret3.binnumber.T

>>> binnumber  = array([46, 51, 27, 26, 31, 46, 32, 52, 46, 51, 46, 51, 66, 72, 27, 32, 47,
       52, 32, 47], dtype=int64)

好的,经过几天的背景思考和快速浏览 binned_statistic_dd() 源代码后,我想我找到了正确的答案,而且它非常简单。

似乎 binned_statistic_dd() 在分箱阶段添加了一组额外的离群分箱,然后在返回直方图结果时删除这些分箱,但保持分箱编号不变(我认为这是为了以防万一为进一步的统计输出重用结果)。

因此,如果您导出扩展的 binnumbers (expand_binnumbers=True),然后从每个 binnumber 中减去 1 以重新调整 bin 索引,您似乎可以计算出 "correct" bin ids。

ret2 = stats.binned_statistic_dd(pos, ids, bins=[xbinEdges, ybinEdges, zbinEdges],
                                statistic='count', expand_binnumbers=True)
bincounts2 = ret2.statistic
binnumber2 = ret2.binnumber
indxnum2 = binnumber2-1
corrected_bin_ids = np.ravel_multi_index((indxnum2),(numX, numY, numZ))

最后又快又简单!