.isin() 比 .query() 快吗
Is .isin() faster than .query()
问题:
嗨,
在搜索选择数据框的方法时(对 Pandas 相对缺乏经验),我有以下问题:
大型数据集哪个更快 - .isin() 或 .query()?
查询阅读起来更直观,因此由于我的工作,我更喜欢这种方法。然而,在一个非常小的示例数据集上测试它,查询似乎慢得多。
有没有人之前正确测试过这个?如果有,结果如何?我在网上搜索过,但找不到另一个 post。
请参阅下面的示例代码,适用于 Python 3.8.5。
非常感谢您的帮助!
代码:
# Packages
import pandas as pd
import timeit
import numpy as np
# Create dataframe
df = pd.DataFrame({'name': ['Foo', 'Bar', 'Faz'],
'owner': ['Canyon', 'Endurace', 'Bike']},
index=['Frame', 'Type', 'Kind'])
# Show dataframe
df
# Create filter
selection = ['Canyon']
# Filter dataframe using 'isin' (type 1)
df_filtered = df[df['owner'].isin(selection)]
%timeit df_filtered = df[df['owner'].isin(selection)]
213 µs ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Filter dataframe using 'isin' (type 2)
df[np.isin(df['owner'].values, selection)]
%timeit df_filtered = df[np.isin(df['owner'].values, selection)]
128 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Filter dataframe using 'query'
df_filtered = df.query("owner in @selection")
%timeit df_filtered = df.query("owner in @selection")
1.15 ms ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
真实数据中的最佳测试,此处为 3k、300k、3M 行与此示例数据的快速比较:
selection = ['Hedge']
df = pd.concat([df] * 1000, ignore_index=True)
In [139]: %timeit df[df['owner'].isin(selection)]
449 µs ± 58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [140]: %timeit df.query("owner in @selection")
1.57 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df = pd.concat([df] * 100000, ignore_index=True)
In [142]: %timeit df[df['owner'].isin(selection)]
8.25 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [143]: %timeit df.query("owner in @selection")
13 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
df = pd.concat([df] * 1000000, ignore_index=True)
In [145]: %timeit df[df['owner'].isin(selection)]
94.5 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [146]: %timeit df.query("owner in @selection")
112 ms ± 499 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
如果勾选docs:
DataFrame.query() using numexpr is slightly faster than Python for large frames
结论 - 真实数据中的最佳测试,因为取决于行数、匹配值的数量以及列表的长度 selection
.
一些生成数据的 perfplot:
假设一些假设数据,以及按比例增加的 selection
大小(帧大小的 10%)。
n=10 的示例数据:
df
:
name owner
0 Constant JoVMq
1 Constant jiKNB
2 Constant WEqhm
3 Constant pXNqB
4 Constant SnlbV
5 Constant Euwsj
6 Constant QPPbs
7 Constant Nqofa
8 Constant qeUKP
9 Constant ZBFce
Selection
:
['ZBFce']
性能反映docs。在较小的帧中,query
的开销明显高于 isin
然而,在大约 200k 行的帧中,性能与 isin
相当,在大约 10m 行的帧中,query
开始变得性能更高。
我同意 ,这与大多数 pandas 运行时问题一样,非常 数据相关,最好的测试是测试给定用例的 真实 数据集,并据此做出决定。
编辑:包括 将 selection
转换为集合并使用 apply
+ in
:
的建议
Perfplot代码:
import string
import numpy as np
import pandas as pd
import perfplot
charset = list(string.ascii_letters)
np.random.seed(5)
def gen_data(n):
df = pd.DataFrame({'name': 'Constant',
'owner': [''.join(np.random.choice(charset, 5))
for _ in range(n)]})
selection = df['owner'].sample(frac=.1).tolist()
return df, selection, set(selection)
def test_isin(params):
df, selection, _ = params
return df[df['owner'].isin(selection)]
def test_query(params):
df, selection, _ = params
return df.query("owner in @selection")
def test_apply_over_set(params):
df, _, set_selection = params
return df[df['owner'].apply(lambda x: x in set_selection)]
if __name__ == '__main__':
out = perfplot.bench(
setup=gen_data,
kernels=[
test_isin,
test_query,
test_apply_over_set
],
labels=[
'test_isin',
'test_query',
'test_apply_over_set'
],
n_range=[2 ** k for k in range(25)],
equality_check=None
)
out.save('perfplot_results.png', transparent=False)
问题:
嗨,
在搜索选择数据框的方法时(对 Pandas 相对缺乏经验),我有以下问题:
大型数据集哪个更快 - .isin() 或 .query()?
查询阅读起来更直观,因此由于我的工作,我更喜欢这种方法。然而,在一个非常小的示例数据集上测试它,查询似乎慢得多。
有没有人之前正确测试过这个?如果有,结果如何?我在网上搜索过,但找不到另一个 post。
请参阅下面的示例代码,适用于 Python 3.8.5。
非常感谢您的帮助!
代码:
# Packages
import pandas as pd
import timeit
import numpy as np
# Create dataframe
df = pd.DataFrame({'name': ['Foo', 'Bar', 'Faz'],
'owner': ['Canyon', 'Endurace', 'Bike']},
index=['Frame', 'Type', 'Kind'])
# Show dataframe
df
# Create filter
selection = ['Canyon']
# Filter dataframe using 'isin' (type 1)
df_filtered = df[df['owner'].isin(selection)]
%timeit df_filtered = df[df['owner'].isin(selection)]
213 µs ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Filter dataframe using 'isin' (type 2)
df[np.isin(df['owner'].values, selection)]
%timeit df_filtered = df[np.isin(df['owner'].values, selection)]
128 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Filter dataframe using 'query'
df_filtered = df.query("owner in @selection")
%timeit df_filtered = df.query("owner in @selection")
1.15 ms ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
真实数据中的最佳测试,此处为 3k、300k、3M 行与此示例数据的快速比较:
selection = ['Hedge']
df = pd.concat([df] * 1000, ignore_index=True)
In [139]: %timeit df[df['owner'].isin(selection)]
449 µs ± 58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [140]: %timeit df.query("owner in @selection")
1.57 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df = pd.concat([df] * 100000, ignore_index=True)
In [142]: %timeit df[df['owner'].isin(selection)]
8.25 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [143]: %timeit df.query("owner in @selection")
13 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
df = pd.concat([df] * 1000000, ignore_index=True)
In [145]: %timeit df[df['owner'].isin(selection)]
94.5 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [146]: %timeit df.query("owner in @selection")
112 ms ± 499 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
如果勾选docs:
DataFrame.query() using numexpr is slightly faster than Python for large frames
结论 - 真实数据中的最佳测试,因为取决于行数、匹配值的数量以及列表的长度 selection
.
一些生成数据的 perfplot:
假设一些假设数据,以及按比例增加的 selection
大小(帧大小的 10%)。
n=10 的示例数据:
df
:
name owner
0 Constant JoVMq
1 Constant jiKNB
2 Constant WEqhm
3 Constant pXNqB
4 Constant SnlbV
5 Constant Euwsj
6 Constant QPPbs
7 Constant Nqofa
8 Constant qeUKP
9 Constant ZBFce
Selection
:
['ZBFce']
性能反映docs。在较小的帧中,query
的开销明显高于 isin
然而,在大约 200k 行的帧中,性能与 isin
相当,在大约 10m 行的帧中,query
开始变得性能更高。
我同意
编辑:包括 selection
转换为集合并使用 apply
+ in
:
Perfplot代码:
import string
import numpy as np
import pandas as pd
import perfplot
charset = list(string.ascii_letters)
np.random.seed(5)
def gen_data(n):
df = pd.DataFrame({'name': 'Constant',
'owner': [''.join(np.random.choice(charset, 5))
for _ in range(n)]})
selection = df['owner'].sample(frac=.1).tolist()
return df, selection, set(selection)
def test_isin(params):
df, selection, _ = params
return df[df['owner'].isin(selection)]
def test_query(params):
df, selection, _ = params
return df.query("owner in @selection")
def test_apply_over_set(params):
df, _, set_selection = params
return df[df['owner'].apply(lambda x: x in set_selection)]
if __name__ == '__main__':
out = perfplot.bench(
setup=gen_data,
kernels=[
test_isin,
test_query,
test_apply_over_set
],
labels=[
'test_isin',
'test_query',
'test_apply_over_set'
],
n_range=[2 ** k for k in range(25)],
equality_check=None
)
out.save('perfplot_results.png', transparent=False)