Binning pandas/numpy array in unequal sizes with approx equal computational cost

Question

我遇到一个问题，必须跨多个内核处理数据。让 df 成为一个 Pandas DataFrameGroupBy (size()) 对象。每个值代表每个 GroupBy 对核心的计算“成本”。我如何将 df 分成 不等大小 的 n-bins 并使用 same （大约）计算费用？

import pandas as pd
import numpy as np
size = 50
rng = np.random.default_rng(2021)
df = pd.DataFrame({
    "one": np.linspace(0, 10, size, dtype=np.uint8),
    "two": np.linspace(0, 5, size, dtype=np.uint8),
    "data": rng.integers(0, 100, size)
})
groups = df.groupby(["one", "two"]).sum()

df
    one  two  data
0     0    0    75
1     0    0    75
2     0    0    49
3     0    0    94
4     0    0    66
...
45    9    4    12
46    9    4    97
47    9    4    12
48    9    4    32
49   10    5    45

人们通常将数据集分成 n-bins，例如下面的代码。然而，将数据集分成 n 个相等的部分是不可取的，因为核心接收非常不平衡的工作负载，例如205 对 788.

n = 4
bins = np.array_split(groups, n) # undesired

[b.sum() for b in bins]  #undesired
[data    788
dtype: int64, data    558
dtype: int64, data    768
dtype: int64, data    205
dtype: int64]

理想的解决方案是将数据拆分为大小不等且具有大致相等的大总和值的箱。 IE。 abs(743-548) = 195 之间的差异比以前的方法 abs(205-788) = 583 小。差异应尽可能小。应该如何实现的简单列表示例：

# only an example to demonstrate desired functionality
example = [[[10, 5], 45], [[2, 1], 187], [[3, 1], 249], [[6, 3], 262]], [[[9, 4], 153], [[4, 2], 248], [[1, 0], 264]], [[[8, 4], 245], [[7, 3], 326]], [[[5, 2], 189], [[0, 0], 359]]

[sum([size for (group, size) in test]) for test in t]  # [743, 665, 571, 548]

是否有更有效的方法将数据集拆分为上述 pandas 或 numpy 中所述的 bins？

对于 split/bin GroupBy 对象很重要，它以与 np.array_split() 返回的类似方式访问数据。

Answer 1

我认为找到了一个好方法。感谢一位同事。

想法是对组大小进行排序（按降序排列）并以“向后 S”模式将组放入箱中。让我用一个例子来说明。假设 n = 3（bin 数）和以下数据：

我们的想法是将一组放在一个垃圾箱中，以“向后 S”模式在垃圾箱之间“从右到左”（反之亦然）。第一个元素在 bin 0 中，第二个元素在 bin 1 中，等等。然后到达最后一个 bin 后退：第四个元素在 bin 2 中，第五个元素在 bin 1 中，等等。下面看元素是如何按组号放入 bin 中的。在括号中。这些值是组大小。

 Bins:  |    0    |    1    |    2    |
        |  359 (0)|  326 (1)|  264 (2)|  
        |  248 (5)|  249 (4)|  262 (3)|
        |  245 (6)|  189 (7)|  187 (8)|
        |         |   45(10)|  153 (9)|

这些 bin 将具有大致相同数量的值，因此具有大致相同的计算“成本”。 bin 大小为：[852, 809, 866] 适合任何感兴趣的人。我试过一个真实世界的数据集，箱子的大小相似。不保证所有数据集的 bin 大小相似。

代码可以变得更高效，但这足以让想法表达出来：

n = 3
size = 50
rng = np.random.default_rng(2021)
df = pd.DataFrame({
    "one": np.linspace(0, 10, size, dtype=np.uint8),
    "two": np.linspace(0, 5, size, dtype=np.uint8),
    "data": rng.integers(0, 100, size)
})

groups = df.groupby(["one", "two"]).sum()
groups = groups.sort_values("data", ascending=False).reset_index(drop=True)

bins = [[] for i in range(n)]
backward = False
i = 0
for group in groups.iterrows():
    bins[i].append(group)
    i = i + 1 if not backward else i - 1
    if i == n:
        backward = True
        i -= 1
    if i == -1 and backward:
        backward = False
        i += 1


[sum([size[0] for (group, size) in bin]) for bin in bins]

Binning pandas/numpy array in unequal sizes with approx equal computational cost

Binning pandas/numpy array in unequal sizes with approx equal computational cost

python

numpy

multiprocessing

bins

pandas