当总共有一百万个字段时，如何将给定集合与可用集合进行比较以找到具有最多交叉元素的集合？

Question

可用集是

A={"one","two","three"}
B={"two","three","four"}
c={"four","five"}

给定的集合是

D = {"four","five","six"}

任务是找到与给定集合相交元素最多的可用集合。

这里
C包含D
的2个字段 B 包含 D 的 1 个字段。
这可以通过找到 D 与 A、B、C 的并集来计算。

当有数百万个可用集合时，如何找到最接近的集合。

Answer 1

以元素成为键的方式构建数据结构。在您的示例中，数据结构可以构建为如下所示

"one": {A}
"two": {A,B}
"three": {A,B}
"four": {B,C}
"five": {C}

现在您需要检查的是获取输入集合 D 中的每个元素，并为每个集合名称添加一个计数器。所以在你的例子中，D 将是 {"four","five","six"}

现在您可以遍历 "four"、"five" 和 "six"

Step 1: The counter will be all zeros initially  

Step 2: After looking at the values for "four" the counter will look like below  
B:1, C:1  

Step 3: After looking at the values for "five" the counter will look like below  
B:1, C:2  

Step 4: After looking at the values for "six" the counter will look like below   
B:1, C:2  

Step 5: Choose the set with the maximum value. In this case it will be C.

如果你正在使用python，你可以使用collections.Counter most_common方法。
https://docs.python.org/3/library/collections.html#collections.Counter

当总共有一百万个字段时，如何将给定集合与可用集合进行比较以找到具有最多交叉元素的集合？

How to compare given set with available sets to find the one with most intersecting elements, when there are a million fields in total?

python

algorithm

machine-learning

pattern-matching

system-design