具有预定义 bin 和 closed/open 间隔的 Bin 变量

Question

我有一组垃圾箱，可以定义为：

每个 bin 的非重叠边界的一组元组：

区间：[(0,1), (1,2), (3,4)]
一组标识每个元组的边界闭合的指标：

Closed_Boundaries: ['right','right','both']
每个区间的一组标签

标签：['first'、'second'、'third']

我正在寻找一种高效、优雅且可扩展的方式来将此分箱应用于 pandas 数据帧中的数字序列，以便结果包括分箱逻辑所标识的相应标签：

Data_input: [3.5, 1, 0.5, 3]

Data_result: ['third', 'first', 'first', 'third']

我尝试使用 pandas.IntervalIndex.from_tuples()，然后使用 pandas.cut()。但是，pandas.cut() 的标签参数在使用 IntervalIndex.from_tuples() 时被禁用，而后者的名称参数不允许我设置标签以用作替换值。

PS：IntervalIndex 不支持标签的 pandas 问题已讨论 here。

Answer 1

如果所有区间都在同一侧闭合

最简单的方法是在 bins 为 IntervalIndex 时忽略 labels 的地方使用 pd.cut. However, there is an outstanding bug。

def cut(array, bins, labels, closed='right'):
    _bins = pd.IntervalIndex.from_tuples(bins, closed=closed)

    x = pd.cut(array, _bins)
    x.categories = labels # workaround for the bug
    return x

array = [3.5, 1, 0.5, 3]
bins = [(0,1), (1,2), (3,4)]
labels = ['first', 'second', 'third']

df = pd.DataFrame({
    'value': array,
    'category': cut(array, bins, labels, closed='right')
})

输出：

   value category
0    3.5    third
1    1.0    first
2    0.5    first
3    3.0      NaN

如果每个间隔都不同

事情变得很慢，因为代码没有向量化，但它在概念上很简单：对于数组中的每个项目，找到它落入的第一个 bin 并添加该 bin 的标签。

def cut(array, bins, labels):
    intervals = [pd.Interval(*b) for b in bins]

    categories = []
    for value in array:
        cat = None
        for i, interval in enumerate(intervals):
            if value in interval:
                cat = labels[i]
                break
        categories.append(cat)

    return categories

cut([3.5, 1, 0.5, 3], bins=[(0,1,'right'),(1,2,'right'),(3,4,'left')], labels=['first', 'second', 'third'])

我修改了 bin 元组以包括它们关闭的那一侧。选项有 left、right、both 和 neither。

具有预定义 bin 和 closed/open 间隔的 Bin 变量

Bin variable with pre-defined bins and closed/open intervals

python

numpy

binning

pandas

如果所有区间都在同一侧闭合

如果每个间隔都不同