匹配来自两个具有不同数据类型的大量列表的数据的最快方法？

Question

我有关于未知（和大量）大小的目录结构的数据和来自 perforce 的关于相同结构的数据。使用 Python，我需要能够将本地数据与 perforce 数据进行匹配，并生成一个文件列表，以反映用户工作区（本地目录）上的所有数据，包括 perforce 中丢失的所有文件，以及工作区中缺少的库中的所有数据。

本地目录结构数据：

我可以完全控制我如何挖掘数据（目前使用 os.walk）

Perforce 数据：

对返回数据的方式没有太多控制
目前以字典列表的形式出现
数据returns无论大小都非常快。

#this list is hundreds of thousands of entries.
p4data_example = [{'depotFile': '//Path/To/Data/file.extension', 'clientFile': 'X:\Path\To\Data\file.extension', 'isMapped': '', 'headAction': 'add', 'headType': 'text', 'headTime': '00000', 'headRev': '1', 'headChange': '0000', 'headModTime': '00000', 'haveRev': '', 'otherOpen': ['stuff'], 'otherAction': ['move/delete'], 'otherChange': ['00000'], 'otherOpens': '1'}]

无论是否有匹配的 p4 数据，我都需要对本地目录文件进行操作。

path_to_data = "X:\Path\To\Data"

p4data = p4.run('fstat', "%s\..." % path_to_data)

for root, dirs, files in os.walk(path_to_data, topdown = False):
    for file in files:
        os.path.join(root,file)

        matchingp4 = None
        for p4item in p4Data:
            if p4item['clientFile'] == file_name:
                matchingp4 = p4item
                break
        do_stuff_with_data(foo, bar)

我相信这不是处理此问题的最有效方法。

延长时间好像来自：

获取所有本地数据
需要多次遍历数据才能找到匹配项。

我需要尽快运行。理想情况下，这将运行在几秒钟内完成，但我知道不知道数据集可以获得多大会导致其变化量未知。

Answer 1

Using Python, I need to be able to match the local data with the perforce data and find all of the local files missing from perforce and all of the perforce data that differs from the local data.

（剪断）

I am confident this is not the most efficient way to handle this.

正确。只需运行 p4 reconcile，Perforce 就会自动完成所有这些工作。 :)

reconcile 本质上做了您想要做的事情，但效率更高——客户端遍历本地树，将文件列表发送到服务器，然后而不是进行 NxN 比较服务器使用映射信息直接请求额外的客户端检查（即校验和以检测差异）以适合单个文件。

匹配来自两个具有不同数据类型的大量列表的数据的最快方法？

fastest way to match data from two massive lists with differing data types?

perforce

python-3.x