在 dedupe 库中增加 max_components 变量

Increase max_components variable in dedupe library

如何增加 max_components 变量的默认值?

默认情况下 max_components 设置为 30000。我需要增加此限制,因为每次我执行重复数据删除(使用相同的数据集)时都会得到不同的结果。

我认为我的数据中的簇总数大于 30000。

回答来自Github

Issue in dedupe github Increase max_components = 30000

If you are getting different results using same saved settings file, then what you reporting is a bug. If you are getting different results from different training data (or even the same training data), that's expected as at various points dedupe uses a random sample to learn good rules.

In either case, I doubt that max_components is related. But, if you want to change it, fork the code and change it.