Python 字典对于交叉比较来说太慢了,需要改进吗?

Python dictionary too slow for crosscomparison, improvements?

我目前在使用 Python 词典时遇到性能问题。我有一些巨大的字典(最多 30k 个条目),我想对这些条目进行交叉比较。那么,如果给出一个条目(标识符是一个键),还有多少其他字典也包含这个条目和这个键?它目前在我的机器上最多需要 5 小时,但它应该在大约几分钟内工作,以便对我的工具有意义。我已经尝试删除条目以提高搜索效率。

all_cached_data 是一个包含这些字典列表的列表。 sources 是一个列表,其中包含有关 all_cached_data 中列表的信息。

appearsin_list = []

# first, get all the cached data
sources = sp.get_sources()
all_cachedata = [0]*len(sources)
for source in sources:
    iscached = source[8]
    sourceid = int(source[0])
    if iscached == "True":
        cachedata, _ = get_local_storage_info(sourceid)
    else:
        cachedata = []
    all_cachedata[sourceid-1] = cachedata

# second, compare cache entries
# iterate over all cached sources
for source in sources:
    sourceid = int(source[0])
    datatype = source[3]
    iscached = source[8]
    if verbose:
        print("Started comparing entries from source " + str(sourceid) +
              " with " + str(len(all_cachedata[sourceid-1])) + " entries.")

    if iscached == "True":
        # iterate over all other cache entries
        for entry in all_cachedata[sourceid-1]:
            # print("Comparing source " + str(sourceid) + " with source " + str(cmpsourceid) + ".")
            appearsin = 0
            for cmpsource in sources:
                cmpsourceid = int(cmpsource[0])
                cmpiscached = cmpsource[8]
                # find entries for same potential threat
                if cmpiscached == "True" and len(all_cachedata[cmpsourceid-1]) > 0 and cmpsourceid != sourceid:
                        for cmpentry in all_cachedata[cmpsourceid-1]:
                            if datatype in cmpentry:
                                if entry[datatype] == cmpentry[datatype]:
                                    appearsin += 1
                                    all_cachedata[cmpsourceid-1].remove(cmpentry)
                                    break

            appearsin_list.append(appearsin)
            if appearsin > 0:
                if verbose:
                    print(entry[datatype] + " appears also in " + str(appearsin) + " more source/s.")
            all_cachedata[sourceid-1].remove(entry)

avg = float(sum(appearsin_list)) / float(len(appearsin_list))

print ("Average appearance: " + str(avg))
print ("Median: " + str(numpy.median(numpy.array(appearsin_list))))
print ("Minimum: " + str(min(appearsin_list)))
print ("Maximum: " + str(max(appearsin_list)))

如果能提供一些加快速度的提示,我将不胜感激。

我觉得你的算法还有待改进;在这种情况下,嵌套循环不是很好。我还认为 Python 可能不是这个特定目的的最佳选择:使用 SQL 在大量数据中进行比较和搜索。您可以使用 sqlite_object 之类的东西将您的数据集转换为 SQLite 数据库并查询它。 如果你想继续使用纯 Python,你可以尝试用 Cython 编译你的脚本;你可以在速度上有一些合理的改进。

http://docs.cython.org/src/tutorial/pure.html

然后您可以使用一些静态类型提示来改进您的代码:

http://docs.cython.org/src/tutorial/pure.html#static-typing