drop_duplicates 在一个范围内

Question

我在 python 中有一个这样的数据框：

    st      se      st_min  st_max  se_min  se_max 
42  922444  923190  922434  922454  923180  923200
24  922445  923190  922435  922455  923180  923200
43  928718  929456  928708  928728  929446  929466
37  928718  929459  928708  928728  929449  929469

正如我们所见，我在前 2 列中有一个范围和初始范围的 10 个位置的变化。

我知道函数 drop_duplicates 可以根据值的精确匹配删除重复行。

但是，如果我想根据值的范围删除行，例如，索引 42 和 24 都在同一范围内（如果我考虑范围 10），索引 43 和 37 在同样的情况。

我该怎么做？

Ps：我无法仅基于一列（例如 st 或 se）删除冗余，我需要使用列 min 和的范围删除基于两列（st 和 se）的冗余最大过滤器...

Answer 1

请注意，例如st 键 42 和 24 的值不同，这样你就可以不要只使用 st 值。

如果例如您的 range 可以定义为 st / 100 （向下舍入为整数），您可以使用此值创建一个列：

df['rng'] = df.st.floordiv(100)

然后使用 drop_duplicates 并将 subset 设置为仅此列，并且删除 rng 列：

df.drop_duplicates(subset='rng').drop(columns=['rng'])

或者 st 键 24 的值应该与上面相同（因为钥匙 42) 和第二对行中的 se 是否相同？在这种情况下，您可以使用：

 df.drop_duplicates(subset=['st', 'se'])

没有任何辅助栏目。

Answer 2

我假设，您想合并所有范围。这样所有重叠的范围都减少到一行。我认为你需要递归地这样做，因为可能有多个范围，形成一个大范围，而不仅仅是两个。您可以这样做（只需将 df 替换为您用来存储数据框的变量）：

# create a dummy key column to produce a cartesian product
df['fake_key']=0
right_df= pd.DataFrame(df, copy=True)
right_df.rename({col: col + '_r' for col in right_df if col!='fake_key'}, axis='columns', inplace=True)

# this variable indicates that we need to perform the loop once more
change=True
# diff and new_diff are used to see, if the loop iteration changed something
# it's monotically increasing btw.
new_diff= (right_df['se_r'] - right_df['st_r']).sum()
while change:
    diff= new_diff
    joined_df= df.merge(right_df, on='fake_key')
    invalid_indexer= joined_df['se']<joined_df['st_r']    
    joined_df.drop(joined_df[invalid_indexer].index, axis='index', inplace=True)
    right_df= joined_df.groupby('st').aggregate({col: 'max' if '_min' not in col else 'min' for col in joined_df})
    # update the ..._min / ..._max fields in the combined range
    for col in ['st_min', 'se_min', 'st_max', 'se_max']:
        col_r= col + '_r'
        col1, col2= (col, col_r) if 'min' in col else (col_r, col)
        right_df[col_r]= right_df[col1].where(right_df[col1]<=right_df[col2], right_df[col2])
    right_df.drop(['se', 'st_r', 'st_min', 'se_min', 'st_max', 'se_max'], axis='columns', inplace=True)
    right_df.rename({'st': 'st_r'}, axis='columns', inplace=True)
    right_df['fake_key']=0
    # now check if we need to iterate once more
    new_diff= (right_df['se_r'] - right_df['st_r']).sum()
    change= diff <= new_diff

# now all ranges which overlap have the same value for se_r
# so we just need to aggregate on se_r to remove them
result= right_df.groupby('se_r').aggregate({col: 'min' if '_max' not in col else 'max' for col in right_df})
result.rename({col: col[:-2] if col.endswith('_r') else col for col in result}, axis='columns', inplace=True)
result.drop('fake_key', axis='columns', inplace=True)

如果对数据执行此操作，您将获得：

            st      se  st_min  st_max  se_min  se_max
se_r                                                  
923190  922444  923190  922434  922455  923180  923200
929459  928718  929459  922434  928728  923180  929469

请注意，如果您的数据集大于几千条记录，您可能需要更改上面生成笛卡尔积的连接逻辑。因此，在第一次迭代中，您会得到大小为 n^2 的 joined_df，其中 n 是输入数据框中的记录数。然后在每次迭代中，由于聚合，joined_df 会变小。

我只是忽略了这一点，因为我不知道你的数据集有多大。避免这种情况会使代码更加复杂。但是如果你需要它，你可以只创建一个辅助数据帧，它允许你 "bin" 两个数据帧上的 se 值，并将分箱值用作 fake_key。这不是很规则的分箱，您必须创建一个数据框，其中包含每个 fake_key 范围 (0...fake_key) 中的所有值。所以例如如果您将假密钥定义为 fake_key=se//1000，您的数据框将包含

fake_key  fake_key_join
922       922
922       921
922       920
...       ...
922       0

如果您用代码替换上面循环中的 merge，那么会将 fake_key 上的此类数据帧与 right_df 合并，并将 fake_key_join 上的结果与 df 您可以使用其余代码并获得与上述相同的结果，但不必生成完整的笛卡尔积。

drop_duplicates 在一个范围内

drop_duplicates in a range

python

redundancy

pandas