如何加速 pandas groupby bins 的聚合？

Question

我为每一列创建了不同的 bin，并根据这些对 DataFrame 进行了分组。

import pandas as pd
import numpy as np

np.random.seed(100)
df = pd.DataFrame(np.random.randn(100, 4), columns=['a', 'b', 'c', 'value'])

# for simplicity, I use the same bin here
bins = np.arange(-3, 4, 0.05)

df['a_bins'] = pd.cut(df['a'], bins=bins)
df['b_bins'] = pd.cut(df['b'], bins=bins)
df['c_bins'] = pd.cut(df['c'], bins=bins)

df.groupby(['a_bins','b_bins','c_bins']).size() 的输出表示组长为2685619。

计算每个组的统计数据

然后，每组的统计数据是这样计算的：

%%timeit
df.groupby(['a_bins','b_bins','c_bins']).agg({'value':['mean']})

>>> 16.9 s ± 637 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

预期输出

是否可以加快速度？
更快的方法还应该支持通过输入 a, b, and c 个值来查找值，如下所示：

df.groupby(['a_bins','b_bins','c_bins']).agg({'value':['mean']}).loc[(-1.72, 0.32, 1.18)]

>>> -0.252436

Answer 1

因为 3 列的 bin 相同，所以使用 cat 访问器中的 codes：

%timeit df.groupby([df['a_bins'].cat.codes, df['b_bins'].cat.codes, df['c_bins'].cat.codes])['value'].mean()
1.82 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Answer 2

这是 scipy.stats.binned_statistic_dd 的一个很好的用例。下面的代码片段仅计算均值统计数据，但支持许多其他统计数据（参见上面链接的文档）：

import numpy as np
import pandas as pd

np.random.seed(100)
df = pd.DataFrame(np.random.randn(100, 4), columns=["a", "b", "c", "value"])

# for simplicity, I use the same bin here
bins = np.arange(-3, 4, 0.05)

df["a_bins"] = pd.cut(df["a"], bins=bins)
df["b_bins"] = pd.cut(df["b"], bins=bins)
df["c_bins"] = pd.cut(df["c"], bins=bins)

# this takes about 35 seconds
result_pandas = df.groupby(["a_bins", "b_bins", "c_bins"]).agg({"value": ["mean"]})

from scipy.stats import binned_statistic_dd

# this takes about 20 ms
result_scipy = binned_statistic_dd(
    df[["a", "b", "c"]].to_numpy(), df["value"], bins=(bins, bins, bins)
)

# this is a verbose way to get a dataframe representation
# for many purposes this probably will not be needed
# takes about 5 seconds
temp_list = []
for na, a in enumerate(result_scipy[1][0][:-1]):
    for nb, b in enumerate(result_scipy[1][1][:-1]):
        for nc, c in enumerate(result_scipy[1][2][:-1]):
            value = result_scipy[0][na, nb, nc]
            temp_list.append([a, b, c, value])

result_scipy_as_df = pd.DataFrame(temp_list, columns=list("abcx"))

# check that the result is the same
result_scipy_as_df["x"].describe() == result_pandas["value"]["mean"].describe()

如果您有兴趣进一步加快速度，可能会有用。

一个重要的警告是 binned_statistic_dd 使用右侧封闭的垃圾桶，例如[0,1)，除了最后一个（参考链接文档中的注释），因此对于一致的 bin 标识符，必须在 pd.cut.

中使用 right=False

这是一个查找示例，请注意，此处精确的 bin 边缘位置增加 1 以获得与 pandas:

中类似的结果

aloc, bloc, cloc = -2.12, 0.23, -1.25
print(result_pandas.loc[(aloc, bloc, cloc)])
print(result_scipy.statistic[
    np.digitize(aloc, result_scipy.bin_edges[0][1:]),
    np.digitize(bloc, result_scipy.bin_edges[1][1:]),
    np.digitize(cloc, result_scipy.bin_edges[2][1:]),
])

Answer 3

对于此数据，我建议您旋转数据，并传递平均值。通常，这会更快，因为您要访问整个数据框，而不是遍历每个组：

(df
 .pivot(None, ['a_bins', 'b_bins', 'c_bins'], 'value')
 .mean()
 .sort_index() # ignore this if you are not fuzzy on order
)

a_bins         b_bins         c_bins       
(-2.15, -2.1]  (0.25, 0.3]    (-1.3, -1.25]    0.929100
               (0.75, 0.8]    (-0.3, -0.25]    0.480411
(-2.05, -2.0]  (-0.1, -0.05]  (0.3, 0.35]     -1.684900
               (0.75, 0.8]    (-0.25, -0.2]   -1.184411
(-2.0, -1.95]  (-0.6, -0.55]  (-1.2, -1.15]   -0.021176
                                                 ...   
(1.7, 1.75]    (-0.75, -0.7]  (1.05, 1.1]     -0.229518
(1.85, 1.9]    (-0.4, -0.35]  (1.8, 1.85]      0.003017
(1.9, 1.95]    (-1.45, -1.4]  (0.1, 0.15]      0.949361
(2.05, 2.1]    (-0.35, -0.3]  (-0.65, -0.6]    0.763184
(2.25, 2.3]    (-0.95, -0.9]  (0.1, 0.15]      2.539432

这与 groupby 的输出匹配：

(df
 .groupby(['a_bins','b_bins','c_bins'])
 .agg({'value':['mean']})
 .dropna()
 .squeeze()
)

a_bins         b_bins         c_bins       
(-2.15, -2.1]  (0.25, 0.3]    (-1.3, -1.25]    0.929100
               (0.75, 0.8]    (-0.3, -0.25]    0.480411
(-2.05, -2.0]  (-0.1, -0.05]  (0.3, 0.35]     -1.684900
               (0.75, 0.8]    (-0.25, -0.2]   -1.184411
(-2.0, -1.95]  (-0.6, -0.55]  (-1.2, -1.15]   -0.021176
                                                 ...   
(1.7, 1.75]    (-0.75, -0.7]  (1.05, 1.1]     -0.229518
(1.85, 1.9]    (-0.4, -0.35]  (1.8, 1.85]      0.003017
(1.9, 1.95]    (-1.45, -1.4]  (0.1, 0.15]      0.949361
(2.05, 2.1]    (-0.35, -0.3]  (-0.65, -0.6]    0.763184
(2.25, 2.3]    (-0.95, -0.9]  (0.1, 0.15]      2.539432
Name: (value, mean), Length: 100, dtype: float64

pivot 选项在我的 PC 上提供了 3.72ms 的速度，而我不得不终止 groupby 选项，因为它花费的时间太长（我的 PC 很旧:)）

同样，这个 works/is 更快的原因是因为均值命中整个数据帧，而不是通过 groupby 中的组。

关于你的另一个问题，你可以轻松索引：


bin_mean = (df
 .pivot(None, ['a_bins', 'b_bins', 'c_bins'], 'value')
 .mean()
 .sort_index() # ignore this if you are not fuzzy on order
)

bin_mean.loc[(-1.72, 0.32, 1.18)]
 -0.25243603652138985

虽然主要问题是 Pandas 对于分类将 return 对于所有行（这是浪费，而且效率不高）；通过 observed = True，您应该会注意到一个显着的改进：

(df.groupby(['a_bins','b_bins','c_bins'], observed=True)
   .agg({'value':['mean']})
)

                                              value
                                               mean
a_bins        b_bins        c_bins                 
(-2.15, -2.1] (0.25, 0.3]   (-1.3, -1.25]  0.929100
              (0.75, 0.8]   (-0.3, -0.25]  0.480411
(-2.05, -2.0] (-0.1, -0.05] (0.3, 0.35]   -1.684900
              (0.75, 0.8]   (-0.25, -0.2] -1.184411
(-2.0, -1.95] (-0.6, -0.55] (-1.2, -1.15] -0.021176
...                                             ...
(1.7, 1.75]   (-0.75, -0.7] (1.05, 1.1]   -0.229518
(1.85, 1.9]   (-0.4, -0.35] (1.8, 1.85]    0.003017
(1.9, 1.95]   (-1.45, -1.4] (0.1, 0.15]    0.949361
(2.05, 2.1]   (-0.35, -0.3] (-0.65, -0.6]  0.763184
(2.25, 2.3]   (-0.95, -0.9] (0.1, 0.15]    2.539432

我的 PC 上的速度约为 7.39 毫秒，比枢轴选项低约 2 倍，但现在速度更快，这是因为只有数据帧中存在的分类是 used/returned。

Answer 4

另一种直接的解决方案，基于 convtools，它能够处理输入数据流并且不需要将输入数据装入内存：

import numpy as np
import pandas as pd

from convtools import conversion as c


def c_bin(left, right, bin_size):
    return c.if_(
        c.or_(c.this < left, c.this > right),
        None,
        ((c.this - left) // bin_size).pipe(
            (c.this * bin_size + left, (c.this + 1) * bin_size + left)
        ),
    )


to_binned = c_bin(-3, 4, 0.05)
to_interval = c.if_(c.this, c.apply_func(pd.Interval, c.this, {}), None)

a_bins = c.item(0).pipe(to_binned)
b_bins = c.item(1).pipe(to_binned)
c_bins = c.item(2).pipe(to_binned)
converter = (
    c.group_by(a_bins, b_bins, c_bins)
    .aggregate(
        {
            "a_bins": a_bins.pipe(to_interval),
            "b_bins": b_bins.pipe(to_interval),
            "c_bins": c_bins.pipe(to_interval),
            "value_mean": c.ReduceFuncs.Average(c.item(3)),
        }
    )
    .gen_converter()
)


np.random.seed(100)
data = np.random.randn(100, 4)

df = pd.DataFrame(converter(data)).set_index(["a_bins", "b_bins", "c_bins"])
df.loc[(-1.72, 0.32, 1.18)]

时间安排：

In [44]: %timeit converter(data)
438 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# passing back to pandas, timing the end-to-end thing:
In [43]: %timeit pd.DataFrame(converter(data)).set_index(["a_bins", "b_bins", "c_bins"]).loc[(-1.72, 0.32, 1.18)]
2.37 ms ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

JFYI：converter(data) 的缩短输出：

[
 ...,
 {'a_bins': Interval(-0.44999999999999973, -0.3999999999999999, closed='right'),
  'b_bins': Interval(0.7000000000000002, 0.75, closed='right'),
  'c_bins': Interval(-0.19999999999999973, -0.1499999999999999, closed='right'),
  'value_mean': -0.08605564337254189},
 {'a_bins': Interval(-0.34999999999999964, -0.2999999999999998, closed='right'),
  'b_bins': Interval(-0.1499999999999999, -0.09999999999999964, closed='right'),
  'c_bins': Interval(0.050000000000000266, 0.10000000000000009, closed='right'),
  'value_mean': 0.18971879197958597},
 {'a_bins': Interval(-2.05, -2.0, closed='right'),
  'b_bins': Interval(0.75, 0.8000000000000003, closed='right'),
  'c_bins': Interval(-0.25, -0.19999999999999973, closed='right'),
  'value_mean': -1.1844114274105708}]

如何加速 pandas groupby bins 的聚合？

How to speed up the agg of pandas groupby bins?

python

numpy

scipy

pandas

scipy.stats

计算每个组的统计数据

预期输出