对于大型数据集，`dedupe.match(generator=True)` 和 `dedupe.matchBlocks()` 之间是否存在性能差异？

Question

我正准备使用 Python 在相当大的数据集（400,000 行）上进行运行重复数据删除。在 DedupeMatching class 的文档中，有 match 和 matchBlocks 函数。对于 match，当 generator=True 匹配时，文档 suggest to only use on small to moderately sized datasets. From looking through the code, I can't gather how matchBlocks in tandem with block_data 在更大的数据集上的表现优于仅 match。

我已经在一个小型数据集（10,000 个实体）上尝试了运行这两种方法，但没有发现差异。

data_d = {'id1': {'name': 'George Bush', 'address': '123 main st.'}
         {'id2': {'name': 'Bill Clinton', 'address': '1600 pennsylvania ave.'}... 
         {id10000...}}

然后任一方法 A:

blocks = deduper._blockData(data_d)
clustered_dupes = deduper.matchBlocks(blocks, threshold=threshold)

或方法B

clustered_dupes = deduper.match(blocks, threshold=threshold, generator=True)

(那么计算量大的部分就是在clustered_dupes对象上运行ning一个for-loop。

cluster_membership = {}
for (cluster_id, cluster) in enumerate(clustered_dupes):
    # Do something with each cluster_id like below
    cluster_membership[cluster_id] = cluster

我expect/wonder如果有性能差异。如果是这样，您能否指出显示这一点的代码并解释原因？

Answer 1

调用 _blockData 然后调用 matchBlocks 与只调用 match 没有区别。事实上，如果您查看代码，您会发现 match 调用了这两个方法。

matchBlocks 暴露的原因是 _blockData 会占用大量内存，您可能希望以其他方式生成块，例如利用关系数据库。

对于大型数据集，`dedupe.match(generator=True)` 和 `dedupe.matchBlocks()` 之间是否存在性能差异？

Is there a performance difference between `dedupe.match(generator=True)` and `dedupe.matchBlocks()` for large datasets?

python-dedupe