具有不规则和交替分箱的分箱统计

Question

这是一个更复杂的实际应用程序的简短完整示例。

使用的库:

import numpy as np
import scipy as sp
import scipy.stats as scist
import matplotlib.pyplot as plt
from itertools import zip_longest

数据：

我有一个数组，其中包含用开始和结束定义的不规则容器，例如像这样（在现实世界中，这种格式是给定的，因为它是另一个进程的输出):

bin_starts = np.array([0, 93, 184, 277, 368])
bin_ends = np.array([89, 178, 272, 363, 458])

我结合了：

bns = np.stack(zip_longest(bin_starts, bin_ends)).flatten()
bns
>>> array([  0,  89,  93, 178, 184, 272, 277, 363, 368, 458])

给出长短间隔的规则交替序列，所有的长度都是不规则的。这是给定的长和短间隔的草图表示：

我有一堆时间序列数据，类似于下面创建的随机数据：

# make some random example data to bin
np.random.seed(45)
x = np.arange(0,460)
y = 5+np.random.randn(460).cumsum()
plt.plot(x,y);

Objective:

我想使用间隔序列来收集数据的统计数据（平均值、百分位数、等等）——但只使用长间隔，即黄色的草图。

假设和说明：

长区间的边缘永远不会重叠；换句话说，长间隔之间总是有一个短间隔。另外，第一个间隔总是很长。

当前解法：

一种方法是在所有间隔上使用 scipy.stats.binned_statistic，然后将结果切片以仅保留每隔一个（即 [::2]），就像这样（对某些统计数据有很大帮助, 就像 np.percentile, 正在阅读 by @ali_m):

ave = scist.binned_statistic(x, y, 
                         statistic = np.nanmean, 
                         bins=bns)[0][::2]

这给了我想要的结果：

plt.plot(np.arange(0,5), ave);

问题: 是否有更 Pythonic 的方式来执行此操作（使用 Numpy、Scipy 或 Pandas 中的任何一个）？

Answer 1

我认为使用 IntervalIndex、pd.cut、groupby 和 agg 的一些组合是获得您想要的东西的相对直接和容易的方法。

我首先制作 DataFrame（不确定这是否是从 np 数组出发的最佳方式）：

df = pd.DataFrame()
df['x'], df['y'] = x, y

然后您可以将您的 bin 定义为元组列表：

bins = list(zip(bin_starts, bin_ends))

使用具有 from_tuples() 方法的 pandas IntervalIndex 创建 bin 以供稍后在 cut 中使用。这很有用，因为您不必依赖切片 bns 数组来解开 "regularly alternating sequence of long and short intervals"-- 相反，您可以明确定义您感兴趣的 bins：

ii = pd.IntervalIndex.from_tuples(bins, closed='both')

closed kwarg 指定是否在区间中包括结束成员数。例如，对于元组 (0, 89)，closed='both' 的间隔将包括 0 和 89（与 left、right 或 neither 相反）。

然后使用 pd.cut() 在数据框中创建一个类别列，这是一种将值合并为区间的方法。可以使用 bin kwarg:

指定 IntervalIndex 对象

df['bin'] = pd.cut(df.x, bins=ii)

最后，使用 df.groupby() 和 .agg() 获取您想要的任何统计数据：

df.groupby('bin')['y'].agg(['mean', np.std])

输出：

                 mean       std
bin                            
[0, 89]     -4.814449  3.915259
[93, 178]   -7.019151  3.912347
[184, 272]   7.223992  5.957779
[277, 363]  15.060402  3.979746
[368, 458]  -0.644127  3.361927

具有不规则和交替分箱的分箱统计

Binned statistics with irregular and alternating bins

python

statistics

numpy

scipy

binning