在给定阈值内合并范围（间隔）的有效方法

Question

我想知道是否有一种有效的方法来计算范围距离并将它们组合成给定的距离。例如，给定范围和距离 d=10:

第一次迭代将是：(4-2) -> 2 -> 2 < 10 -> OK -> (1,7)

(12-7) -> 5 -> 5 < 10 -> 确定 -> (1,15)

(32-15) -> 17 -> 17 < 10 -> KO

(38-36) -> 2 -> 2 < 10 -> 确定 -> (32,41)

所需（结果）数据集：

1   15
32  41
...

此算法（列表、元组、循环）的成本如果未有效实施，可能会给主程序带来风险。

提前致谢！！

Answer 1

来源 DF：

In [27]: df
Out[27]:
   start  end
0      1    2
1      4    7
2     12   15
3     32   36
4     38   41

In [28]: threshold = 10

矢量化解决方案：

In [31]: (df.groupby(df['start'].sub(df['end'].shift()).ge(threshold).cumsum())
    ...:    .agg({'start':'first','end':'last'}))
    ...:
Out[31]:
   start  end
0      1   15
1     32   41

解释：

In [32]: df['start'].sub(df['end'].shift())
Out[32]:
0     NaN
1     2.0
2     5.0
3    17.0
4     2.0
dtype: float64

In [33]: df['start'].sub(df['end'].shift()).ge(threshold)
Out[33]:
0    False
1    False
2    False
3     True
4    False
dtype: bool

In [34]: df['start'].sub(df['end'].shift()).ge(threshold).cumsum()
Out[34]:
0    0
1    0
2    0
3    1
4    1
dtype: int32

在给定阈值内合并范围（间隔）的有效方法

Efficient way of merging ranges (intervals) within a given threshold

python

algorithm

merge

biopython

pandas