基于另一个范围的数据帧范围过滤的高效解决方案

Efficient solution of dataframe range filtering based on another ranges

我尝试了以下代码来查找不在另一个数据帧范围内的数据帧范围。但是,计算大文件需要一天多的时间,因为在最后 2 个 for 循环中,它比较每一行。我的 24 个数据帧中的每一个都有大约 10^8 行。是否有任何有效的替代方法可以替代以下方法?

请参阅此线程以更好地了解我的 I/O:

我的做法:
我最初从 (df1['first.start'], df1['first.end'])(df2['first.start'], df2['first.end']) 创建了元组对,以便应用 range() 函数。之后,我设置了条件 df1_ranges 是否在 df2_ranges 的范围内。这里的边缘情况是 df1['first.start'] = df1['first.end']。我从迭代中收集过滤后的索引,然后传递到 df1。

df2_lst=[]
for i,j in zip(temp_df2['first.start'], temp_df2['first.end']):
    df2_lst.append(i)
    df2_lst.append(j)
df1_lst=[]
for i,j in zip(df1['first.start'], df1['first.end']):
    df1_lst.append(i)
    df1_lst.append(j)

def range_subset(range1, range2):
    """Whether range1 is a subset of range2."""
    if not range1:
        return True  # empty range is a subset of anything
    if not range2:
        return False  # non-empty range can't be a subset of empty range
    if len(range1) > 1 and range1.step % range2.step:
        return False  # must have a single value or integer multiple step
    return range1.start in range2 and range1[-1] in range2

##### FUNCTION FOR CREATING CHUNKS OF LISTS ####
def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i],lst[i+1]

df1_lst2 = list(chunks(df1_lst,2))
df2_lst2 = list(chunks(df2_lst,2))
indices=[]
for idx,i in enumerate(df1_lst2): #main list
    x,y = i
    for j in df2_lst2: #filter list
        m,n = j
        if((x!=y) & (range_subset(range(x,y), range(m,n)))): #checking if the main list exists in the filter range or not
            indices.append(idx) #collecting the filtered indices

df1.iloc[indices]

如果nmdf1df2中的行数,任何算法都需要至少进行n * m次比较才能校验df1 中的每个范围相对于 df2 中的每个范围,您发布的代码的问题是 (a) 它有太多的中间步骤和 (b) 它使用缓慢的 Python 循环。如果你切换到 numpy 广播,它在引擎盖下使用高度优化的 C 循环,它会快很多。

numpy 广播的缺点是内存:它会创建一个 n * m 字节的比较矩阵,你的问题的大小可能 运行 你的计算机内存不足。我们可以通过分块 df1 来缓解这种情况,以牺牲性能来换取较低的内存使用量。

# Sample data
def random_dataframe(size):
    a = np.random.randint(1, 100, 2*size).cumsum()
    return pd.DataFrame({
        'first.start': a[::2],
        'first.end': a[1::2]
    })

n, m = 10_000_000, 1000
np.random.seed(42)
df1 = random_dataframe(n)
df2 = random_dataframe(m)

# ---------------------------

# Prepare the Start and End time of df2 for comparison
# [:, None] raise the array by one dimension, which is necessary
# for array broadcasting
s2 = df2['first.start'].to_numpy()[:, None]
e2 = df2['first.end'].to_numpy()[:, None]

# A chunk_size that is too small or too big will lower performance.
# Experiment to find a sweet spot
chunk_size = 100_000
offset = 0
mask = []

while offset < len(df1):
    s1 = df1['first.start'].to_numpy()[offset:offset+chunk_size]
    e1 = df1['first.end'].to_numpy()[offset:offset+chunk_size]
    mask.append(
        ((s2 <= s1) & (s1 <= e2) & (s2 <= e1) & (e1 <= e2)).any(axis=0)
    )
    offset += chunk_size

mask = np.hstack(mask)

# ---------------------------

# If memory is not a concern, use the following code. However, this
# may run slower than the chunking approach due to increased size of
# the array broadcasting operation. Profile your code to find out.
s2 = df2['first.start'].to_numpy()[:, None]
e2 = df2['first.end'].to_numpy()[:, None]

s1 = df1['first.start'].to_numpy()
e1 = df1['first.end'].to_numpy()

mask = ((s2 <= s1) & (s1 <= e2) & (s2 <= e1) & (e1 <= e2)).any(axis=0)

分块代码在我的电脑上用了 30 秒。要访问结果:

df1[mask]  # ranges in df1 that are completely surrounded by a range in df2
df1[~mask] # ranges in df1 that are NOT completely surrounded by any range in df2

通过调整比较,您也可以检查重叠范围。