Python 字典对于交叉比较来说太慢了,需要改进吗?
Python dictionary too slow for crosscomparison, improvements?
我目前在使用 Python 词典时遇到性能问题。我有一些巨大的字典(最多 30k 个条目),我想对这些条目进行交叉比较。那么,如果给出一个条目(标识符是一个键),还有多少其他字典也包含这个条目和这个键?它目前在我的机器上最多需要 5 小时,但它应该在大约几分钟内工作,以便对我的工具有意义。我已经尝试删除条目以提高搜索效率。
all_cached_data
是一个包含这些字典列表的列表。 sources
是一个列表,其中包含有关 all_cached_data
中列表的信息。
appearsin_list = []
# first, get all the cached data
sources = sp.get_sources()
all_cachedata = [0]*len(sources)
for source in sources:
iscached = source[8]
sourceid = int(source[0])
if iscached == "True":
cachedata, _ = get_local_storage_info(sourceid)
else:
cachedata = []
all_cachedata[sourceid-1] = cachedata
# second, compare cache entries
# iterate over all cached sources
for source in sources:
sourceid = int(source[0])
datatype = source[3]
iscached = source[8]
if verbose:
print("Started comparing entries from source " + str(sourceid) +
" with " + str(len(all_cachedata[sourceid-1])) + " entries.")
if iscached == "True":
# iterate over all other cache entries
for entry in all_cachedata[sourceid-1]:
# print("Comparing source " + str(sourceid) + " with source " + str(cmpsourceid) + ".")
appearsin = 0
for cmpsource in sources:
cmpsourceid = int(cmpsource[0])
cmpiscached = cmpsource[8]
# find entries for same potential threat
if cmpiscached == "True" and len(all_cachedata[cmpsourceid-1]) > 0 and cmpsourceid != sourceid:
for cmpentry in all_cachedata[cmpsourceid-1]:
if datatype in cmpentry:
if entry[datatype] == cmpentry[datatype]:
appearsin += 1
all_cachedata[cmpsourceid-1].remove(cmpentry)
break
appearsin_list.append(appearsin)
if appearsin > 0:
if verbose:
print(entry[datatype] + " appears also in " + str(appearsin) + " more source/s.")
all_cachedata[sourceid-1].remove(entry)
avg = float(sum(appearsin_list)) / float(len(appearsin_list))
print ("Average appearance: " + str(avg))
print ("Median: " + str(numpy.median(numpy.array(appearsin_list))))
print ("Minimum: " + str(min(appearsin_list)))
print ("Maximum: " + str(max(appearsin_list)))
如果能提供一些加快速度的提示,我将不胜感激。
我觉得你的算法还有待改进;在这种情况下,嵌套循环不是很好。我还认为 Python 可能不是这个特定目的的最佳选择:使用 SQL 在大量数据中进行比较和搜索。您可以使用 sqlite_object 之类的东西将您的数据集转换为 SQLite 数据库并查询它。
如果你想继续使用纯 Python,你可以尝试用 Cython 编译你的脚本;你可以在速度上有一些合理的改进。
http://docs.cython.org/src/tutorial/pure.html
然后您可以使用一些静态类型提示来改进您的代码:
我目前在使用 Python 词典时遇到性能问题。我有一些巨大的字典(最多 30k 个条目),我想对这些条目进行交叉比较。那么,如果给出一个条目(标识符是一个键),还有多少其他字典也包含这个条目和这个键?它目前在我的机器上最多需要 5 小时,但它应该在大约几分钟内工作,以便对我的工具有意义。我已经尝试删除条目以提高搜索效率。
all_cached_data
是一个包含这些字典列表的列表。 sources
是一个列表,其中包含有关 all_cached_data
中列表的信息。
appearsin_list = []
# first, get all the cached data
sources = sp.get_sources()
all_cachedata = [0]*len(sources)
for source in sources:
iscached = source[8]
sourceid = int(source[0])
if iscached == "True":
cachedata, _ = get_local_storage_info(sourceid)
else:
cachedata = []
all_cachedata[sourceid-1] = cachedata
# second, compare cache entries
# iterate over all cached sources
for source in sources:
sourceid = int(source[0])
datatype = source[3]
iscached = source[8]
if verbose:
print("Started comparing entries from source " + str(sourceid) +
" with " + str(len(all_cachedata[sourceid-1])) + " entries.")
if iscached == "True":
# iterate over all other cache entries
for entry in all_cachedata[sourceid-1]:
# print("Comparing source " + str(sourceid) + " with source " + str(cmpsourceid) + ".")
appearsin = 0
for cmpsource in sources:
cmpsourceid = int(cmpsource[0])
cmpiscached = cmpsource[8]
# find entries for same potential threat
if cmpiscached == "True" and len(all_cachedata[cmpsourceid-1]) > 0 and cmpsourceid != sourceid:
for cmpentry in all_cachedata[cmpsourceid-1]:
if datatype in cmpentry:
if entry[datatype] == cmpentry[datatype]:
appearsin += 1
all_cachedata[cmpsourceid-1].remove(cmpentry)
break
appearsin_list.append(appearsin)
if appearsin > 0:
if verbose:
print(entry[datatype] + " appears also in " + str(appearsin) + " more source/s.")
all_cachedata[sourceid-1].remove(entry)
avg = float(sum(appearsin_list)) / float(len(appearsin_list))
print ("Average appearance: " + str(avg))
print ("Median: " + str(numpy.median(numpy.array(appearsin_list))))
print ("Minimum: " + str(min(appearsin_list)))
print ("Maximum: " + str(max(appearsin_list)))
如果能提供一些加快速度的提示,我将不胜感激。
我觉得你的算法还有待改进;在这种情况下,嵌套循环不是很好。我还认为 Python 可能不是这个特定目的的最佳选择:使用 SQL 在大量数据中进行比较和搜索。您可以使用 sqlite_object 之类的东西将您的数据集转换为 SQLite 数据库并查询它。 如果你想继续使用纯 Python,你可以尝试用 Cython 编译你的脚本;你可以在速度上有一些合理的改进。
http://docs.cython.org/src/tutorial/pure.html
然后您可以使用一些静态类型提示来改进您的代码: