路口 2 pandas 数据框
intersection 2 pandas dataframe
在我的问题中,我有 2 个数据帧 mydataframe1
和 mydataframe2
,如下所示。
mydataframe1
Out[13]:
Start End Remove
50 60 1
61 105 0
106 150 1
151 160 0
161 180 1
181 200 0
201 400 1
mydataframe2
Out[14]:
Start End
55 100
105 140
151 154
155 185
220 240
从 mydataframe2
我想删除 mydataframe1
中任何 "Remove"
=1 个间隔中包含(也部分)间隔开始-结束的行。换句话说,mydataframe2
的区间和 mydataframe1
的区间之间不应该有任何交集
在这种情况下,mydataframe2 变为
mydataframe2
Out[15]:
Start End
151 154
我认为这应该可行:
mydataframe2[mydataframe2.Start.isin(mydataframe1[mydataframe1.Remove != 0].Start)]
分解:
# This filter will remove anything which has Remove not 0
filter_non_remove = mydataframe1.Remove != 0
# This provides a valid Sequence of Start values
valid_starts = mydataframe1[mydataframe1.Remove != 0].Start
# Another filter, that checks whether the Start
# value is in the valid_starts Sequence
is_df2_valid = mydataframe2.Start.isin(valid_starts)
# Final applied filter
output = mydataframe2[is_df2_valid]
您可以从标记为 Remove
的列中获取所有唯一范围值,然后评估 mydataframe2
中包含的 Start
和 End
日期不在任何范围值。第一部分将定义所有属于 Start/End 值的唯一值是 Remove = 1.
start_end_remove = mydataframe1[mydataframe1['Remove'] == 1][['Start', 'End']].as_matrix()
remove_ranges = set([])
for x in start_end_remove:
remove_ranges.update(np.arange(x[0], x[1] + 1))
接下来,您可以根据唯一的一组范围值评估 mydataframe2
。如果 mydataframe2
的 Start/End 值在值范围内,则通过标记是否应在新列中删除它们来将它们从数据框中删除。定义了一个函数来查看任何范围之间是否存在重叠,然后将该函数应用于 mydataframe2
中的每一行并删除范围重叠的行。
def evaluate_in_range(x, remove_ranges):
s = x[0]
e = x[1]
eval_range = set(np.arange(s, e + 1))
if len(eval_range.intersection(remove_ranges)) > 0:
return 1
else:
return 0
mydataframe2['Remove'] = mydataframe2[['Start', 'End']].apply(lambda x: evaluate_in_range(x, remove_ranges), axis=1)
mydataframe2.drop(mydataframe2[mydataframe2['Remove'] == 1].index, inplace=True)
这个怎么样:
mydataframe1['key']=1
mydataframe2['key']=1
df3 = mydataframe2.merge(mydataframe1, on="key")
df3['s_gt_s'] = df3.Start_y > df3.Start_x
df3['s_lt_e'] = df3.Start_y < df3.End_x
df3['e_gt_s'] = df3.End_y > df3.Start_x
df3['e_lt_e'] = df3.End_y < df3.End_x
df3['s_in'] = df3.s_gt_s & df3.s_lt_e
df3['e_in'] = df3.e_gt_s & df3.e_lt_e
df3['overlaps'] = df3.s_in | df3.e_in
my_new_dataframe = df3[df3.overlaps & df3.Remove==1][['End_x','Start_x']].drop_duplicates()
我们可以使用Medial- or length-oriented tree: Overlap test:
In [143]: d1 = d1.assign(s=d1.Start+d1.End, d=d1.End-d1.Start)
In [144]: d2 = d2.assign(s=d2.Start+d2.End, d=d2.End-d2.Start)
In [145]: d1
Out[145]:
Start End Remove d s
0 50 60 1 10 110
1 61 105 0 44 166
2 106 150 1 44 256
3 151 160 0 9 311
4 161 180 1 19 341
5 181 200 0 19 381
6 201 400 1 199 601
In [146]: d2
Out[146]:
Start End d s
0 55 100 45 155
1 105 140 35 245
2 151 154 3 305
3 155 185 30 340
4 220 240 20 460
现在我们可以检查重叠间隔并进行过滤:
In [148]: d2[~d2[['s','d']]\
...: .apply(lambda x: ((d1.loc[d1.Remove==1, 's'] - x.s).abs() <
...: d1.loc[d1.Remove==1, 'd'] +x.d).any(),
...: axis=1)]\
...: .drop(['s','d'], 1)
...:
Out[148]:
Start End
2 151 154
您可以使用 pd.IntervalIndex
作为十字路口
获取要删除的行
In [313]: dfr = df1.query('Remove == 1')
从要删除的范围构造 IntervalIndex
In [314]: s1 = pd.IntervalIndex.from_arrays(dfr.Start, dfr.End, 'both')
从待测构造IntervalIndex
In [315]: s2 = pd.IntervalIndex.from_arrays(df2.Start, df2.End, 'both')
Select 行 s2 不在 s1 范围内
In [316]: df2.loc[[x not in s1 for x in s2]]
Out[316]:
Start End
2 151 154
详情
In [320]: df1
Out[320]:
Start End Remove
0 50 60 1
1 61 105 0
2 106 150 1
3 151 160 0
4 161 180 1
5 181 200 0
6 201 400 1
In [321]: df2
Out[321]:
Start End
0 55 100
1 105 140
2 151 154
3 155 185
4 220 240
In [322]: dfr
Out[322]:
Start End Remove
0 50 60 1
2 106 150 1
4 161 180 1
6 201 400 1
IntervalIndex 详细信息
In [323]: s1
Out[323]:
IntervalIndex([[50, 60], [106, 150], [161, 180], [201, 400]]
closed='both',
dtype='interval[int64]')
In [324]: s2
Out[324]:
IntervalIndex([[55, 100], [105, 140], [151, 154], [155, 185], [220, 240]]
closed='both',
dtype='interval[int64]')
In [326]: [x not in s1 for x in s2]
Out[326]: [False, False, True, False, False]
在我的问题中,我有 2 个数据帧 mydataframe1
和 mydataframe2
,如下所示。
mydataframe1
Out[13]:
Start End Remove
50 60 1
61 105 0
106 150 1
151 160 0
161 180 1
181 200 0
201 400 1
mydataframe2
Out[14]:
Start End
55 100
105 140
151 154
155 185
220 240
从 mydataframe2
我想删除 mydataframe1
中任何 "Remove"
=1 个间隔中包含(也部分)间隔开始-结束的行。换句话说,mydataframe2
的区间和 mydataframe1
在这种情况下,mydataframe2 变为
mydataframe2
Out[15]:
Start End
151 154
我认为这应该可行:
mydataframe2[mydataframe2.Start.isin(mydataframe1[mydataframe1.Remove != 0].Start)]
分解:
# This filter will remove anything which has Remove not 0
filter_non_remove = mydataframe1.Remove != 0
# This provides a valid Sequence of Start values
valid_starts = mydataframe1[mydataframe1.Remove != 0].Start
# Another filter, that checks whether the Start
# value is in the valid_starts Sequence
is_df2_valid = mydataframe2.Start.isin(valid_starts)
# Final applied filter
output = mydataframe2[is_df2_valid]
您可以从标记为 Remove
的列中获取所有唯一范围值,然后评估 mydataframe2
中包含的 Start
和 End
日期不在任何范围值。第一部分将定义所有属于 Start/End 值的唯一值是 Remove = 1.
start_end_remove = mydataframe1[mydataframe1['Remove'] == 1][['Start', 'End']].as_matrix()
remove_ranges = set([])
for x in start_end_remove:
remove_ranges.update(np.arange(x[0], x[1] + 1))
接下来,您可以根据唯一的一组范围值评估 mydataframe2
。如果 mydataframe2
的 Start/End 值在值范围内,则通过标记是否应在新列中删除它们来将它们从数据框中删除。定义了一个函数来查看任何范围之间是否存在重叠,然后将该函数应用于 mydataframe2
中的每一行并删除范围重叠的行。
def evaluate_in_range(x, remove_ranges):
s = x[0]
e = x[1]
eval_range = set(np.arange(s, e + 1))
if len(eval_range.intersection(remove_ranges)) > 0:
return 1
else:
return 0
mydataframe2['Remove'] = mydataframe2[['Start', 'End']].apply(lambda x: evaluate_in_range(x, remove_ranges), axis=1)
mydataframe2.drop(mydataframe2[mydataframe2['Remove'] == 1].index, inplace=True)
这个怎么样:
mydataframe1['key']=1
mydataframe2['key']=1
df3 = mydataframe2.merge(mydataframe1, on="key")
df3['s_gt_s'] = df3.Start_y > df3.Start_x
df3['s_lt_e'] = df3.Start_y < df3.End_x
df3['e_gt_s'] = df3.End_y > df3.Start_x
df3['e_lt_e'] = df3.End_y < df3.End_x
df3['s_in'] = df3.s_gt_s & df3.s_lt_e
df3['e_in'] = df3.e_gt_s & df3.e_lt_e
df3['overlaps'] = df3.s_in | df3.e_in
my_new_dataframe = df3[df3.overlaps & df3.Remove==1][['End_x','Start_x']].drop_duplicates()
我们可以使用Medial- or length-oriented tree: Overlap test:
In [143]: d1 = d1.assign(s=d1.Start+d1.End, d=d1.End-d1.Start)
In [144]: d2 = d2.assign(s=d2.Start+d2.End, d=d2.End-d2.Start)
In [145]: d1
Out[145]:
Start End Remove d s
0 50 60 1 10 110
1 61 105 0 44 166
2 106 150 1 44 256
3 151 160 0 9 311
4 161 180 1 19 341
5 181 200 0 19 381
6 201 400 1 199 601
In [146]: d2
Out[146]:
Start End d s
0 55 100 45 155
1 105 140 35 245
2 151 154 3 305
3 155 185 30 340
4 220 240 20 460
现在我们可以检查重叠间隔并进行过滤:
In [148]: d2[~d2[['s','d']]\
...: .apply(lambda x: ((d1.loc[d1.Remove==1, 's'] - x.s).abs() <
...: d1.loc[d1.Remove==1, 'd'] +x.d).any(),
...: axis=1)]\
...: .drop(['s','d'], 1)
...:
Out[148]:
Start End
2 151 154
您可以使用 pd.IntervalIndex
作为十字路口
获取要删除的行
In [313]: dfr = df1.query('Remove == 1')
从要删除的范围构造 IntervalIndex
In [314]: s1 = pd.IntervalIndex.from_arrays(dfr.Start, dfr.End, 'both')
从待测构造IntervalIndex
In [315]: s2 = pd.IntervalIndex.from_arrays(df2.Start, df2.End, 'both')
Select 行 s2 不在 s1 范围内
In [316]: df2.loc[[x not in s1 for x in s2]]
Out[316]:
Start End
2 151 154
详情
In [320]: df1
Out[320]:
Start End Remove
0 50 60 1
1 61 105 0
2 106 150 1
3 151 160 0
4 161 180 1
5 181 200 0
6 201 400 1
In [321]: df2
Out[321]:
Start End
0 55 100
1 105 140
2 151 154
3 155 185
4 220 240
In [322]: dfr
Out[322]:
Start End Remove
0 50 60 1
2 106 150 1
4 161 180 1
6 201 400 1
IntervalIndex 详细信息
In [323]: s1
Out[323]:
IntervalIndex([[50, 60], [106, 150], [161, 180], [201, 400]]
closed='both',
dtype='interval[int64]')
In [324]: s2
Out[324]:
IntervalIndex([[55, 100], [105, 140], [151, 154], [155, 185], [220, 240]]
closed='both',
dtype='interval[int64]')
In [326]: [x not in s1 for x in s2]
Out[326]: [False, False, True, False, False]