如何有效地找到重叠区间?

How to efficiently find overlapping intervals?

我有以下玩具示例数据框,df

      f_low    f_high
   0.476201  0.481915
   0.479161  0.484977
   0.485997  0.491911
   0.503259  0.508679
   0.504687  0.510075
   0.504687  0.670075
   0.666093  0.670438
   0.765602  0.770028
   0.766884  0.771307
   0.775986  0.780398
   0.794590  0.798965

要找到它的重叠子集,我使用以下代码:

df = df.sort_values('f_low')
for row in df.itertuples():
    iix = pd.IntervalIndex.from_arrays(df.f_low, df.f_high, closed='neither')
    span_range = pd.Interval(row.f_low, row.f_high)
    fx = df[(iix.overlaps(span_range))].copy()

我希望得到这样的重叠数据帧:

   # iteration 1: over row.f_low=0.476201  row.f_high=0.481915 

      f_low    f_high
   0.476201  0.481915
   0.479161  0.484977

   # iteration 2: over row.f_low=0.503259  row.f_high=0.508679 
      f_low    f_high
   0.503259  0.508679 
   0.504687  0.510075
   0.504687 0.670075

   # iteration 3: over row.f_low=0.504687  row.f_high=0.670075 
      f_low    f_high
   0.666093  0.670438

等等

这很好用,但是由于数据框很大并且有很多重叠,这需要很长时间才能处理。此外,当对 pandas.

使用 Intervaloverlaps 方法时,我正在测试重叠的时间间隔不会自行获取

这意味着表示一系列重叠的置信区间,每一行都被迭代。

除了遍历所有元组之外,是否有更有效地针对给定区间提取重叠区间的方法?

这是一块未排序的实际数据帧:

f_low   f_high
0.504687  0.670075
0.476201  0.481915
0.765602  0.770028
0.479161  0.484977
0.766884  0.771307
0.485997  0.491911
0.666093  0.670438
0.503259  0.508679
0.775986  0.780398
0.504687  0.510075
0.794590  0.798965

这个有用吗?

intervals = df.apply(lambda row: pd.Interval(row['f_low'], row['f_high']), axis=1)
overlaps = [
    (i, j, x, y, x.overlaps(y)) 
    for ((i,x),(j,y))
    in itertools.product(enumerate(intervals), repeat=2)
]

>>> overlaps[:3]
[(0,
  0,
  Interval(0.47620100000000004, 0.481915, closed='right'),
  Interval(0.47620100000000004, 0.481915, closed='right'),
  True),
 (0,
  1,
  Interval(0.47620100000000004, 0.481915, closed='right'),
  Interval(0.47916099999999995, 0.48497700000000005, closed='right'),
  True),
 (0,
  2,
  Interval(0.47620100000000004, 0.481915, closed='right'),
  Interval(0.485997, 0.491911, closed='right'),
  False)]

从这里你可以得到原始DataFrame中的数字索引。不确定它的性能如何,但它应该比你现在拥有的更好。

如果我理解正确,你想将你当前的 df 分成数据帧,其中初始间隔由第一行设置,第二个间隔由不相交的第一行定义,等等。下面方法会做到这一点,如果组数不是太大,应该会非常有效:

df = df.sort_values("f_low").reset_index(drop=True)
idx = 0
dfs = []
while True:
    low = df.f_low[idx]
    high = df.f_high[idx]
    sub_df = df[(df.f_low <= high) & (low <= df.f_low)]
    dfs.append(sub_df)
    idx = sub_df.index.max() + 1
    if idx > df.index.max():
        break

输出:

[      f_low    f_high
 0  0.476201  0.481915
 1  0.479161  0.484977,
       f_low    f_high
 2  0.485997  0.491911,
       f_low    f_high
 3  0.503259  0.508679
 4  0.504687  0.510075
 5  0.504687  0.670075,
       f_low    f_high
 6  0.666093  0.670438,
       f_low    f_high
 7  0.765602  0.770028
 8  0.766884  0.771307,
       f_low    f_high
 9  0.775986  0.780398,
       f_low    f_high
 10  0.79459  0.798965]

使用numpy的数组广播:

l1 = df['f_low'].to_numpy()
h1 = df['f_high'].to_numpy()

l2 = l1[:, None]
h2 = h1[:, None]

# Check for overlap
# mask is an n * n matrix indicating if interval i overlaps with interval j
mask = (l1 < h2) & (h1 > l2)

# If interval i overlaps intervla j then j also overlaps i. We only want to get
# one of the two pairs. Hence the `triu` (triangle, upper)
# Every interval also overlaps itself and we don't want that either. Hence the k=1
overlaps = np.triu(mask, k=1).nonzero()

overlaps 中的结果需要一些解释:

(array([0, 3, 3, 4, 5, 7]),
 array([1, 4, 5, 5, 6, 8]))

# Row 0 overlaps with row 1
# Row 3 overlaps with row 4
# Row 3 overlaps with row 5
# ....

连续重叠

"f_low" 值视为入口点并分配​​值 1。将 "f_high" 值视为退出点并分配值 -1。如果我们按递增顺序处理所有值并累加指定值,那么当累加值大于零时,我们将有一个重叠区间。如果累积值达到零,我们知道我们已经退出任何重叠间隔。

笔记:

这将连续重叠的所有间隔分组。如果一个间隔不与第一个 重叠,但 确实与链中的最后一个重叠,则它算作重叠。

我将在该解决方案下方为其他选项提供类似的解决方案。


尝试的例子

#  1     3                     (Interval from 1 to 3)
#     2        5               (Interval from 2 to 5)
#                    7     9   (Interval from 7 to 9)

#  1  1 -1    -1     1    -1   (Entry/Exit values)
#  1  2  1     0     1     0   (Accumulated values)
#              ⇑           ⇑
# zero indicates leaving all overlaps

这表示一旦我们进入13的区间,我们就不会离开所有重叠的区间,直到我们到达5的右侧从 25 的间隔,由达到零的累加值指示。


我将使用生成器 return 列出具有重叠间隔的原始数据帧的索引。

归根结底,这应该是 N * Log(N) 涉及的排序。

def gen_overlaps(df):
    df = df.sort_values('f_low')
    
    # get sorter lows and highs
    a = df.to_numpy().ravel().argsort()
    
    # get free un-sorter
    b = np.empty_like(a)
    b[a] = np.arange(len(a))
    
    # get ones and negative ones
    # to indicate entering into
    # and exiting an interval
    c = np.ones(df.shape, int) * [1, -1]
    
    # if we sort by all values and
    # accumulate when we enter and exit
    # the accumulated value should be 
    # zero when there are no overlaps
    d = c.ravel()[a].cumsum()[b].reshape(df.shape)
    #             ⇑           ⇑
    # sort by value order     unsort to get back to original order
    
    indices = []
    for i, indicator in zip(df.index, d[:, 1] == 0):
        indices.append(i)
        if indicator:
            yield indices
            indices = []
    if indices:
        yield indices
    

然后我会用pd.concat来组织它们来表达我的意思。 kkth 组。有些组只有一个间隔。

pd.concat({
    k: df.loc[i] for k, i in
    enumerate(gen_overlaps(df))
})

         f_low    f_high
0 0   0.476201  0.481915
  1   0.479161  0.484977
1 2   0.485997  0.491911
2 3   0.503259  0.508679
  4   0.504687  0.510075
  5   0.504687  0.670075
  6   0.666093  0.670438
3 7   0.765602  0.770028
  8   0.766884  0.771307
4 9   0.775986  0.780398
5 10  0.794590  0.798965

如果我们只想要那些重叠的...

pd.concat({
    k: df.loc[i] for k, i in
    enumerate(gen_overlaps(df))
    if len(i) > 1
})

        f_low    f_high
0 0  0.476201  0.481915
  1  0.479161  0.484977
2 3  0.503259  0.508679
  4  0.504687  0.510075
  5  0.504687  0.670075
  6  0.666093  0.670438
3 7  0.765602  0.770028
  8  0.766884  0.771307

仅重叠队列中的下一个间隔

这是一个更简单的解决方案,符合 OP 所需的输出。

def gen_overlaps(df):
    df = df.sort_values('f_low')
        
    indices = []
    cursor = None
    for i, low, high in df.itertuples():
        if not indices:
            cursor = high
        if low <= cursor:
            indices.append(i)
        else:
            yield indices
            indices = []
            cursor = high
    if len(indices) > 1:
        yield indices
    

pd.concat({
    k: df.loc[i] for k, i in
    enumerate(gen_overlaps(df))
})

        f_low    f_high
0 0  0.476201  0.481915
  1  0.479161  0.484977
1 3  0.503259  0.508679
  4  0.504687  0.510075
  5  0.504687  0.670075
2 7  0.765602  0.770028
  8  0.766884  0.771307

我不确定你需要什么样的重叠,但我认为这种方法可以工作:

  • 确保您的掩码足够。
  • 创建一个字典,每次迭代的键是 f_low 和 f_high。
  • 过滤原始数据帧
  • 正如您所说,真正的用例应该是大型数据集,所以 query 必须优于 .loc
import pandas as pd
df = pd.DataFrame(
    [
        [0.504687, 0.670075],
        [0.476201, 0.481915],
        [0.765602, 0.770028],
        [0.479161, 0.484977],
        [0.766884, 0.771307],
        [0.485997, 0.491911],
        [0.666093, 0.670438],
        [0.503259, 0.508679],
        [0.775986, 0.780398],
        [0.504687, 0.510075],
        [0.794590, 0.798965]
    ],
    columns=["f_low", "f_high"]
)
overlap = {
    (row.f_low, row.f_high): df.query("(@row.f_low <= f_low <= @row.f_high) or (@row.f_low <= f_high <= @row.f_high)")
    for row in df.itertuples()
}