.isin() 比 .query() 快吗

Is .isin() faster than .query()

问题:

嗨,

在搜索选择数据框的方法时(对 Pandas 相对缺乏经验),我有以下问题:

大型数据集哪个更快 - .isin() 或 .query()?

查询阅读起来更直观,因此由于我的工作,我更喜欢这种方法。然而,在一个非常小的示例数据集上测试它,查询似乎慢得多。

有没有人之前正确测试过这个?如果有,结果如何?我在网上搜索过,但找不到另一个 post。

请参阅下面的示例代码,适用于 Python 3.8.5。

非常感谢您的帮助!

代码:
# Packages
import pandas as pd
import timeit
import numpy as np


# Create dataframe
df = pd.DataFrame({'name': ['Foo', 'Bar', 'Faz'],
               'owner': ['Canyon', 'Endurace', 'Bike']},
                index=['Frame', 'Type', 'Kind'])

# Show dataframe
df

# Create filter
selection = ['Canyon']

# Filter dataframe using 'isin' (type 1)
df_filtered = df[df['owner'].isin(selection)] 

%timeit df_filtered = df[df['owner'].isin(selection)]
213 µs ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# Filter dataframe using 'isin' (type 2)
df[np.isin(df['owner'].values, selection)]

%timeit df_filtered = df[np.isin(df['owner'].values, selection)]
128 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


# Filter dataframe using 'query'
df_filtered = df.query("owner in @selection")

%timeit df_filtered = df.query("owner in @selection")
1.15 ms ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

真实数据中的最佳测试,此处为 3k、300k、3M 行与此示例数据的快速比较:

selection = ['Hedge']

df = pd.concat([df] * 1000, ignore_index=True)
In [139]: %timeit df[df['owner'].isin(selection)]
449 µs ± 58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [140]: %timeit df.query("owner in @selection")
1.57 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

df = pd.concat([df] * 100000, ignore_index=True)
In [142]: %timeit df[df['owner'].isin(selection)]
8.25 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [143]: %timeit df.query("owner in @selection")
13 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

df = pd.concat([df] * 1000000, ignore_index=True)
In [145]: %timeit df[df['owner'].isin(selection)]
94.5 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [146]: %timeit df.query("owner in @selection")
112 ms ± 499 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

如果勾选docs

DataFrame.query() using numexpr is slightly faster than Python for large frames

结论 - 真实数据中的最佳测试,因为取决于行数、匹配值的数量以及列表的长度 selection.

一些生成数据的 perfplot:

假设一些假设数据,以及按比例增加的 selection 大小(帧大小的 10%)。

n=10 的示例数据:

df:

       name  owner
0  Constant  JoVMq
1  Constant  jiKNB
2  Constant  WEqhm
3  Constant  pXNqB
4  Constant  SnlbV
5  Constant  Euwsj
6  Constant  QPPbs
7  Constant  Nqofa
8  Constant  qeUKP
9  Constant  ZBFce

Selection:

['ZBFce']

性能反映docs。在较小的帧中,query 的开销明显高于 isin 然而,在大约 200k 行的帧中,性能与 isin 相当,在大约 10m 行的帧中,query 开始变得性能更高。

我同意 ,这与大多数 pandas 运行时问题一样,非常 数据相关,最好的测试是测试给定用例的 真实 数据集,并据此做出决定。


编辑:包括 selection 转换为集合并使用 apply + in:

的建议


Perfplot代码:

import string

import numpy as np
import pandas as pd
import perfplot

charset = list(string.ascii_letters)

np.random.seed(5)


def gen_data(n):
    df = pd.DataFrame({'name': 'Constant',
                       'owner': [''.join(np.random.choice(charset, 5))
                                 for _ in range(n)]})
    selection = df['owner'].sample(frac=.1).tolist()
    return df, selection, set(selection)


def test_isin(params):
    df, selection, _ = params
    return df[df['owner'].isin(selection)]


def test_query(params):
    df, selection, _ = params
    return df.query("owner in @selection")


def test_apply_over_set(params):
    df, _, set_selection = params
    return df[df['owner'].apply(lambda x: x in set_selection)]


if __name__ == '__main__':
    out = perfplot.bench(
        setup=gen_data,
        kernels=[
            test_isin,
            test_query,
            test_apply_over_set
        ],
        labels=[
            'test_isin',
            'test_query',
            'test_apply_over_set'
        ],
        n_range=[2 ** k for k in range(25)],
        equality_check=None
    )
    out.save('perfplot_results.png', transparent=False)