如何使 "value is in dateframe column" 更快

Question

我的代码将 userID、categoryID 和 date 作为输入值。我想检查条目是否有效，例如如果 userID 确实存在于我的数据集中。它按照我的方式工作，但我必须等待几秒钟（！）直到执行主程序。

var_uid = int(input("Please enter a user ID: "))
var_catid = input("Please enter a category ID: ")
var_date = input("Please enter a date to restrict the considered data (YYYY-MM-DD): ")


if (~var_uid in df_data['UserID'].values) :
    print("There is no such user with this UserID. Please enter a different UserID.")
elif (~df_data['CategoryID'].str.contains(var_catid).any()) :
    print("There is no such category with this CategoryID. Please enter a different CategoryID")
else:
    ### I convert my date to datetime object to be able to do some operations with it. ###
date = pd.to_datetime(var_date)

s_all = df_data[df_data.columns[7]]
s_all_datetime = pd.to_datetime(s_all)
df_data['UTCtime'] = s_all_datetime

min_date_str = "2012-04-03"
min_date = pd.to_datetime(min_date_str)
max_date_str = "2013-02-16"
max_date = pd.to_datetime(max_date_str)


if (date < min_date or date > max_date) :
    print("There is noch such date. Please enter a different date from 2012-04-03 until 2013-02-16")
else:
    some code

我知道，Whosebug 不是用来做这项工作的，事实上我的代码可以工作。尽管如此，您能否至少给出一些提示，什么是更快的实施？数据框有 230k 行，如果我的程序必须用每个 if 子句运行覆盖它，这当然不是最好的方法。

我以为我可以提取例如我的 UserID 列的唯一值，将其保存在列表中并用我的 if 子句检查它。但是

df_data['UserID'].unique.tolist()

不起作用。

感谢您的帮助。

/编辑：这里是 df_data.info() 和 df_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227428 entries, 0 to 227427
Data columns (total 8 columns):
UserID            227428 non-null int64
VenueID           227428 non-null object
CategoryID        227428 non-null object
CategoryName      227428 non-null object
Latitude          227428 non-null float64
Longitude         227428 non-null float64
TimezoneOffset    227428 non-null int64
UTCtime           227428 non-null object
dtypes: float64(2), int64(2), object(4)
memory usage: 13.9+ MB
None

负责人：

   UserID                   VenueID                CategoryID         CategoryName   Latitude  Longitude  TimezoneOffset                         UTCtime
0     470  49bbd6c0f964a520f4531fe3  4bf58dd8d48988d127951735  Arts & Crafts Store  40.719810 -74.002581            -240  Tue Apr 03 18:00:09 +0000 2012
1     979  4a43c0aef964a520c6a61fe3  4bf58dd8d48988d1df941735               Bridge  40.606800 -74.044170            -240  Tue Apr 03 18:00:25 +0000 2012
2      69  4c5cc7b485a1e21e00d35711  4bf58dd8d48988d103941735       Home (private)  40.716162 -73.883070            -240  Tue Apr 03 18:02:24 +0000 2012
3     395  4bc7086715a7ef3bef9878da  4bf58dd8d48988d104941735       Medical Center  40.745164 -73.982519            -240  Tue Apr 03 18:02:41 +0000 2012
4      87  4cf2c5321d18a143951b5cec  4bf58dd8d48988d1cb941735           Food Truck  40.740104 -73.989658            -240  Tue Apr 03 18:03:00 +0000 2012

Answer 1

你说的

是什么意思

But df_data['UserID'].unique.tolist() doesn't work.

?

你的意思是命令失败了？那可能是因为 unique 是一个函数，你必须调用它

df_data['UserID'].unique().tolist()

或者你的意思是还是太慢了？在那种情况下，您可能不想使用 python 列表，因为它仍然需要遍历每个条目。如果改为使用集合，则最坏情况下的检索时间为 O(logn)。所以

set(df['UserID'].tolist())

现在可以更快地查找用户，但是如果类别需要更复杂的调用（例如 str.contains），您仍然需要查看列表。但是，如果类别的基数要小得多，您可能只对其应用 unique 并处理较小的列表。

Answer 2

对于此类包含检查，您应该将用户（和类别）作为索引：

if (~var_uid in df_data['UserID'].values) :

elif (~df_data['CategoryID'].str.contains(var_catid).any()) :

一旦这些在索引中（注意：这应该在这个块之外完成，而不是每次）：

df = df_data.set_index(["UserId", "CategoryID"])

然后你可以在 O(1) 中查找：

user_id in df.index.levels[0]
category_id in df.index.levels[1]  # granted this doesn't do the str contains (but that'll always be inefficient)

您可以手动创建这些，同样您必须一次而不是每次查找都需要这样做才能获得好处：

pd.Index(df_date["UserID"])
# if lots of non-unique users this will be more space efficient
pd.Index(df_date["UserID"].unique())

Answer 3

考虑创建查找索引，然后您将获得日志速度访问。这是一个例子：

import pandas as pd
import numpy as np

n = int(1e6)
np.random.seed(0)
df = pd.DataFrame({
    'uid': np.arange(n), 
    'catid': np.repeat('foo bar baz', n),
})

较慢的版本：

>>> %timeit for i in range(n // 2, n // 2 + 1000): i in df.uid.values
1 loop, best of 3: 2.32 s per loop

但是您可以预先计算索引：

>>> uids = pd.Index(df.uid.values)
>>> %timeit for i in range(n // 2, n//2 + 1000): i in uids
1000 loops, best of 3: 412 µs per loop

哇哦，真快。让我们看看创建索引需要多长时间：

>>> %timeit uids = pd.Index(df.uid.values)
10000 loops, best of 3: 22.5 µs per loop

您也可以使用 set（尽管对于像 UserID 这样的整数，使用 pandas 索引会更快），例如对于 CategoryID 你可以预先计算：

>>> catids = set(s for catid in df.catid.values for s in catid.split())

然后检查

>>> catid in catids

这会快很多。

Answer 4

非常感谢所有贡献者！对于我的 UserID 我使用了将其设置为索引的解决方案。对于我的 CategoryID，我创建了集合并将其存储在我的程序中。

此外，我发现了另一个更糟糕的瓶颈：

s_all = df_data[df_data.columns[7]]
s_all_datetime = pd.to_datetime(s_all)
df_data['UTCtime'] = s_all_datetime

它将我的 'UTCtime' 列转换为日期时间对象...使用它进行 230k 次迭代...^^ 我只做了一次，现在就存储了新的数据框。每次只需要加载 .csv，但速度要快得多。

如何使 "value is in dateframe column" 更快

How to make "value is in dateframe column" quicker

python

iteration

runtime

dataframe

pandas