为什么 HashSet<T>.IntersectWith 的时间复杂度取决于相等比较器？

Question

我指的是 MSDN (https://msdn.microsoft.com/en-us/library/bb293080(v=vs.110).aspx):

Remarks:

If the collection represented by the other parameter is a HashSet collection with the same equality comparer as the current HashSet object, this method is an O(n) operation. Otherwise, this method is an O(n + m) operation, where n is Count and m is the number of elements in other.

我想了解相等比较器的作用。

如果 other 也是 HashSet，交集可以这样工作：

T[] array = this.ToArray(); // O(1)
foreach (T item in array) // iterate through this => O(n)
    if (!other.Contains(item)) // other is a HashSet => O(1)
        this.Remove(item); // this is a HashSet => O(1)

如 MSDN 所述，总共 O(n)。但据我了解，如果 other 是 HashSet，它应该始终是 O(n) - 不管它有什么相等比较器！

如果 other 不是 HashSet，上面代码片段中 other.Contains 的复杂度会更高（例如 O(log m) 表示 SortedSet或 O(m) 表示 List）。因为我们有嵌套操作，所以我们必须将数字相乘（因此 O(n*log m) 对应 SortedSet 或 O(n*m) 对应 List）以获得总复杂度，这比规定的 O(n+m)。因此，other 不是 HashSet 的情况的方法似乎有所不同。

也许是这样的：

HashSet<T> intersectionSet = new HashSet<T>(this.Comparer); // O(1)
foreach (T item in other) // iterate through other => O(m)
    if (this.Contains(item)) // this is a HashSet => O(1)
        intersectionSet.Add(item); // intersectionSet is a HashSet => O(1)
this.Clear(); // O(n)
foreach (T item in intersectionSet) // O(m) in the worst case, because intersectionSet can have at most m elements
    this.Add(item); // O(1)

所以我们得到 O(m+n) 如 MSDN 所述。同样，我看不出相等比较器在复杂性中扮演什么角色。

由于微软在 designing/implementing IntersectWith 中投入了大量的心思和人力，我相信他们的版本（相等比较器在时间复杂度中发挥作用）是最好的。所以我假设我在推理上犯了一些错误。你能给我指点一下吗？

Answer 1

If other is also a HashSet, intersection could work like this:

T[] array = this.ToArray(); // O(1)
foreach (T item in array) // iterate through this => O(n)
    if (!other.Contains(item)) // other is a HashSet => O(1)  A
        this.Remove(item); // this is a HashSet => O(1)       B

如果两个哈希集使用不同的相等比较器，这将是一个不正确的实现。我用 A 标记的行将使用 other 的相等比较器，我用 B 标记的行将使用 this 的相等比较器。因此，行 other.Contains(item) 检查了错误的事情：它检查 other 是否认为它包含 item。应该检查的是this是否认为other包含item.

但是除了创建数组（这不是 O(1)，并且 Microsoft 可以通过使用 HashSet 的私有字段来避免），你想出的几乎就是你能看到的在 the reference source Microsoft 实际上是在相等比较器匹配的情况下这样做的。

为什么 HashSet<T>.IntersectWith 的时间复杂度取决于相等比较器？

Why does time complexity of HashSet<T>.IntersectWith depend on the equality comparers?

c#

algorithm

hashset

time-complexity

set-intersection