检查 pandas 数据框中的值是否在另一个数据框中其他两个列的任意两个值内
Check if value in pandas dataframe is within any two values of two other columns in another dataframe
我有两个不同长度的数据帧。
dfSamples(63012375 行)和
dfFixations(200000 行)。
dfSamples = pd.DataFrame({'tSample':[4, 6, 8, 10, 12, 14]})
dfFixations = pd.DataFrame({'tStart':[4,12],'tEnd':[8,14]})
我想检查 dfSamples 中的每个值是否在 dfFixations 中给出的任意两个范围内,然后为该值分配一个标签。我发现了这个:,但是循环解决方案非常慢,我无法使用任何其他解决方案。
工作(但很慢)示例:
labels = np.empty_like(dfSamples['tSample']).astype(np.chararray)
for i, fixation in dfFix.iterrows():
log_range = dfSamples['tSample'].between(fixation['tStart'], fixation['tEnd'])
labels[log_range] = 'fixation'
labels[labels != 'fixation'] = 'no_fixation'
dfSamples['labels'] = labels
按照这个例子:我试图对此进行矢量化,但没有成功。
def check_range(samples, tstart, tend):
log_range = (samples > tstart) & (samples < tend)
return log_range
fixations = list(map(check_range, dfSamples['tSample'], dfFix['tStart'], dfFix['tEnd']))
非常感谢任何帮助!
设置
dfSamples = pd.DataFrame({'tSample':[4, 6, 8, 10, 12, 14]})
dfFixations = pd.DataFrame({'tStart':[4,12],'tEnd':[8,14]})
解决方案
从起点和终点创建区间索引
ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
ii.contains
是一种检查区间索引中的每个区间是否包含一个点的方法,例如
dfSamples["tSample"].apply(ii.contains)
给予
0 [True, False]
1 [True, False]
2 [True, False]
3 [False, False]
4 [False, True]
5 [False, True]
Name: tSample, dtype: object
我们将采用此结果,将 any
函数应用于每个元素(列表)以获得 pandas.Series
个布尔值,然后我们可以将其与 numpy.where
一起使用
dfSamples["labels"] = np.where(dfSamples["tSample"].apply(ii.contains).apply(any), "fixation", "no_fixation")
结果
tSample labels
0 4 fixation
1 6 fixation
2 8 no_fixation
3 10 no_fixation
4 12 fixation
5 14 no_fixation
编辑:更快的版本
使用 piso
v0.6.0
import piso
import numpy as np
ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
contained = np.logical_or.reduce(piso.contains(ii, dfSamples["tSample"], include_index=False), axis=0)
dfSamples["labels"] = np.where(contained, "fixation", "no_fixation")
这将在与@jezrael 的解决方案相似的时间内运行,但是它可以处理间隔重叠的情况,例如
dfFixations = pd.DataFrame({'tStart':[4,5,12],'tEnd':[8,9,14]})
使用IntervalIndex.from_arrays
with IntervalIndex.get_indexer
, if not match is returned -1
, so checked and set ouput in numpy.where
:
i = pd.IntervalIndex.from_arrays(dfFixations['tStart'],
dfFixations['tEnd'],
closed="both")
pos = i.get_indexer(dfSamples['tSample'])
dfSamples['labels'] = np.where(pos != -1, "fixation", "no_fixation")
print (dfSamples)
tSample labels
0 4 fixation
1 6 fixation
2 8 fixation
3 10 no_fixation
4 12 fixation
5 14 fixation
性能: 理想中nice sorted not overlap data,实际应该是性能不同,最好测试一下。
dfSamples = pd.DataFrame({'tSample':range(10000)})
dfFixations = pd.DataFrame({'tStart':range(0, 10000, 5),'tEnd':range(2, 10000, 5)})
In [165]: %%timeit
...: labels = np.empty_like(dfSamples['tSample']).astype(np.chararray)
...: for i, fixation in dfFixations.iterrows():
...: log_range = dfSamples['tSample'].between(fixation['tStart'], fixation['tEnd'])
...: labels[log_range] = 'fixation'
...: labels[labels != 'fixation'] = 'no_fixation'
...: dfSamples['labels'] = labels
...:
...:
1.25 s ± 52.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [168]: %%timeit
...: ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
...: dfSamples["labels1"] = np.where(dfSamples["tSample"].apply(ii.contains).apply(any), "fixation", "no_fixation")
...:
315 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [170]: %%timeit
...: ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
...: contained = np.logical_or.reduce(piso.contains(ii, dfSamples["tSample"], include_index=False), axis=0)
...: dfSamples["labels1"] = np.where(contained, "fixation", "no_fixation")
...:
82.4 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [166]: %%timeit
...: s = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
...: pos = s.get_indexer(dfSamples['tSample'])
...: dfSamples['labels'] = np.where(pos != -1, "fixation", "no_fixation")
...:
27.8 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
我有两个不同长度的数据帧。 dfSamples(63012375 行)和 dfFixations(200000 行)。
dfSamples = pd.DataFrame({'tSample':[4, 6, 8, 10, 12, 14]})
dfFixations = pd.DataFrame({'tStart':[4,12],'tEnd':[8,14]})
我想检查 dfSamples 中的每个值是否在 dfFixations 中给出的任意两个范围内,然后为该值分配一个标签。我发现了这个:
工作(但很慢)示例:
labels = np.empty_like(dfSamples['tSample']).astype(np.chararray)
for i, fixation in dfFix.iterrows():
log_range = dfSamples['tSample'].between(fixation['tStart'], fixation['tEnd'])
labels[log_range] = 'fixation'
labels[labels != 'fixation'] = 'no_fixation'
dfSamples['labels'] = labels
按照这个例子:
def check_range(samples, tstart, tend):
log_range = (samples > tstart) & (samples < tend)
return log_range
fixations = list(map(check_range, dfSamples['tSample'], dfFix['tStart'], dfFix['tEnd']))
非常感谢任何帮助!
设置
dfSamples = pd.DataFrame({'tSample':[4, 6, 8, 10, 12, 14]})
dfFixations = pd.DataFrame({'tStart':[4,12],'tEnd':[8,14]})
解决方案
从起点和终点创建区间索引
ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
ii.contains
是一种检查区间索引中的每个区间是否包含一个点的方法,例如
dfSamples["tSample"].apply(ii.contains)
给予
0 [True, False]
1 [True, False]
2 [True, False]
3 [False, False]
4 [False, True]
5 [False, True]
Name: tSample, dtype: object
我们将采用此结果,将 any
函数应用于每个元素(列表)以获得 pandas.Series
个布尔值,然后我们可以将其与 numpy.where
一起使用
dfSamples["labels"] = np.where(dfSamples["tSample"].apply(ii.contains).apply(any), "fixation", "no_fixation")
结果
tSample labels
0 4 fixation
1 6 fixation
2 8 no_fixation
3 10 no_fixation
4 12 fixation
5 14 no_fixation
编辑:更快的版本
使用 piso
v0.6.0
import piso
import numpy as np
ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
contained = np.logical_or.reduce(piso.contains(ii, dfSamples["tSample"], include_index=False), axis=0)
dfSamples["labels"] = np.where(contained, "fixation", "no_fixation")
这将在与@jezrael 的解决方案相似的时间内运行,但是它可以处理间隔重叠的情况,例如
dfFixations = pd.DataFrame({'tStart':[4,5,12],'tEnd':[8,9,14]})
使用IntervalIndex.from_arrays
with IntervalIndex.get_indexer
, if not match is returned -1
, so checked and set ouput in numpy.where
:
i = pd.IntervalIndex.from_arrays(dfFixations['tStart'],
dfFixations['tEnd'],
closed="both")
pos = i.get_indexer(dfSamples['tSample'])
dfSamples['labels'] = np.where(pos != -1, "fixation", "no_fixation")
print (dfSamples)
tSample labels
0 4 fixation
1 6 fixation
2 8 fixation
3 10 no_fixation
4 12 fixation
5 14 fixation
性能: 理想中nice sorted not overlap data,实际应该是性能不同,最好测试一下。
dfSamples = pd.DataFrame({'tSample':range(10000)})
dfFixations = pd.DataFrame({'tStart':range(0, 10000, 5),'tEnd':range(2, 10000, 5)})
In [165]: %%timeit
...: labels = np.empty_like(dfSamples['tSample']).astype(np.chararray)
...: for i, fixation in dfFixations.iterrows():
...: log_range = dfSamples['tSample'].between(fixation['tStart'], fixation['tEnd'])
...: labels[log_range] = 'fixation'
...: labels[labels != 'fixation'] = 'no_fixation'
...: dfSamples['labels'] = labels
...:
...:
1.25 s ± 52.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [168]: %%timeit
...: ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
...: dfSamples["labels1"] = np.where(dfSamples["tSample"].apply(ii.contains).apply(any), "fixation", "no_fixation")
...:
315 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [170]: %%timeit
...: ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
...: contained = np.logical_or.reduce(piso.contains(ii, dfSamples["tSample"], include_index=False), axis=0)
...: dfSamples["labels1"] = np.where(contained, "fixation", "no_fixation")
...:
82.4 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [166]: %%timeit
...: s = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
...: pos = s.get_indexer(dfSamples['tSample'])
...: dfSamples['labels'] = np.where(pos != -1, "fixation", "no_fixation")
...:
27.8 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)