Dictionary 如何使用 Swift 中的 Equatable 协议?

How does Dictionary use the Equatable protocol in Swift?

为了解决 ,我一直在研究实现 Hashable 协议的自定义结构。我试图查看在填充 Dictionary.

时,根据是否存在哈希冲突,等价运算符重载 (==) 被调用了多少次

更新

@matt wrote a much cleaner example of a custom struct that implements the Hashable protocol and shows how often hashValue and == get called. I am copying his code below. To see my original example, check out the edit history.

struct S : Hashable {
    static func ==(lhs:S,rhs:S) -> Bool {
        print("called == for", lhs.id, rhs.id)
        return lhs.id == rhs.id
    }
    let id : Int
    var hashValue : Int {
        print("called hashValue for", self.id)
        return self.id
    }
    init(_ id:Int) {self.id = id}
}
var s = Set<S>()
for i in 1...5 {
    print("inserting", i)
    s.insert(S(i))
}

这会产生结果:

/*
inserting 1
called hashValue for 1
inserting 2
called hashValue for 2
called == for 1 2
called hashValue for 1
called hashValue for 2
inserting 3
called hashValue for 3
inserting 4
called hashValue for 4
called == for 3 4
called == for 1 4
called hashValue for 2
called hashValue for 3
called hashValue for 1
called hashValue for 4
called == for 3 4
called == for 1 4
inserting 5
called hashValue for 5
*/

由于 Hashable 使用 Equatable 来区分散列冲突(无论如何我假设),我希望 func ==() 仅在出现散列冲突时才被调用。但是,在上面@matt 的示例中根本没有散列冲突,但仍在调用 == 。在我的其他强制散列冲突的实验中(参见这个问题的编辑历史),== 似乎被随机调用了几次。

这是怎么回事?

我正在从 bugs.swift.org 复制我的回答。它谈论集合,但细节以相同的方式适用于字典。

In hashed collections, collisions can occur whenever the number of buckets is smaller than the keyspace. When you're creating a new Set without specifying a minimum capacity, the set might have only one bucket, so when you're inserting the second element, a collision occurs. The insert method will then decide if the storage should be grown, using something called a load factor. If the storage was grown, the existing elements have to be migrated over to the new storage buffer. That's when you're seeing all those extra calls to hashValue when inserting 4.

The reason you're still seeing more calls to == than you would expect if the number of buckets is equal or higher to the number of elements has to do with an implementation detail of the bucket index calculation. The bits of the hashValue are mixed or "shuffled" before the modulo operation. This is to cut down on excessive collisions for types with bad hash algorithms.

嗯,这就是你的答案:

https://bugs.swift.org/browse/SR-3330?focusedCommentId=19980&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-19980

What's actually happening:

  • We hash a value only once on insertion.
  • We don't use hashes for comparison of elements, only ==. Using hashes for comparison is only reasonable if you store the hashes, but that means more memory usage for every Dictionary. A compromise that needs evaluation.
  • We try to insert the element before evaluating if the Dictionary can fit that element. This is because the element might already be in the Dictionary, in which case we don't need any more capacity.
  • When we resize the Dictionary, we have to rehash everything, because we didn't store the hashes.

So what you're seeing is:

  • one hash of the search key
  • some =='s (searching for a space)
  • hashes of every element in the collection (resize)
  • one hash of the search key (actually totally wasteful, but not a big deal considering it only happens after an O reallocation)
  • some =='s (searching for a space in the new buffer)

我们都完全错了。他们根本不使用散列 — == — 来决定这是否是一个独特的密钥。然后在集合增长的情况下进行第二轮调用。

虽然问题已经得到解答,但这些答案在评论中提出了一些额外的问题。 @Suragch 问我是否会将我的评论包含在新答案中以帮助其他可能也对 EquatableHashable 之间的关系感到困惑的人。首先,我必须说我对底层机制只有初步的了解,但我会尽力解释我所知道的。

Equatable 是一个相当简单的概念,我们只需查看 Swift 文档即可获得此协议的简洁定义:

Equatable: A type that can be compared for value equality.

如果一个类型是相等的,我们可以用==比较两个实例。简单。

Hashable 是另一回事。当我第一次阅读 Swift 文档中该协议的定义时,我真的笑出了声:

Hashable: A type that can be hashed into a Hasher to produce an integer hash value.

如果您还没有解决这个问题,那么您并不孤单!无论如何,如果 == 用于确定实例是否真正唯一(它需要在集合中或用作字典中的键),为什么我们根本需要 Hashable ? (这是@Suragch 在评论中提出的问题。)

这个问题涉及散列集合(或散列表)(如集合和字典)的基本性质。考虑一下为什么您可能首先选择字典而不是数组。当您需要通过已知索引以外的其他方式查找或引用实例时,您会选择字典,对吗?与数组不同,字典的元素不是按顺序编号的,这使得查找内容变得更加困难。如果我们只有 ==,我们将不得不一个接一个地迭代字典的元素,并且随着字典大小的增加,速度会越来越慢。

这就是散列函数的魔力所在。散列函数(或散列器)将唯一键作为参数,return 是元素的地址。您如何确定它 return 是正确的地址?因为它与最初用于设置该地址的功能相同!创建字典时,它采用每个键(或者更确切地说,每个键的唯一标识属性),根据一些秘密公式将它们混合在一起,并为每个键输出一个数字(哈希值),然后从这些数字中得出每个元素的新索引。稍后,当您想要查找其中一个元素时,哈希器会得到相同的参数,因此它 return 是相同的值。而且因为您所做的只是调用一个函数,所以不涉及迭代并且结果很快,无论集合有多大。

不过有一个问题。没有哈希是完美的。即使您为它提供独特的参数,偶尔它也可能 return 两个完全不同的元素具有相同的哈希值——哈希冲突。当发生这种情况时,它需要检查这两个元素是否真的相同,当然,它通过调用 ==.

来做到这一点

但是在你的例子中,你直接操纵了 hashValue(这是人们在 hash(into:) came along!) and it still called ==! I mean, in theory, it shouldn't need to do this, since there aren't any collisions. But the answer is there in the comment quoted by robinkunde 之前所做的事情:

In hashed collections, collisions can occur whenever the number of buckets is smaller than the keyspace

虽然一般来说,我们不需要担心 Swift 的 built-in 哈希函数的实现细节,但这个特定的细节很重要……在幕后,哈希函数需要另一个参数,这就是集合的大小。如果大小发生变化(就像在遍历范围并将新元素插入集合时重复发生的那样),散列器要么尝试插入比索引槽(或桶)更多的元素,然后发生碰撞,或者它需要使用足够的内存从头开始重新散列所有内容,以便为每个元素提供唯一索引。似乎发生的是两者的结合,正如 comment quoted by matt 所说:

We try to insert the element before evaluating if the Dictionary can fit that element. This is because the element might already be in the Dictionary, in which case we don't need any more capacity.


这就是我对散列集合、散列函数与您的 == 方法之间的关系以及意外行为的原因的更简单解释的尝试。但这一切又给我提出了一个问题……为什么我们需要手动声明 Hashable?难道 Apple 不能采用某种算法来合成所有 Equatable 类型的 Hashable 一致性吗?我的意思是,hash(into:) documentation 表示:

The components used for hashing must be the same as the components compared in your type’s == operator implementation.

如果组件需要相同,难道 Swift 不能仅从我们对 Equatable 的实现中推断出我们的意图吗?我不确定为什么它不能为那些不想更多地控制细节的人提供这样的便利(类似于它提供默认初始化程序的方式)。也许有一天 Swift 会提供这个?不过现在,他们将它们作为单独的关注点保留,Hashable 继承自 Equatable