python 键为数字时的字典

python dictionaries when keys are numbers

当键为数字时,我对 python 中的字典属性有疑问。 在我的例子中,当我打印带有数字键的字典时,打印结果将按键排序,但在另一种情况下(键是字符串)字典是无序的。我想知道字典里有没有这条规则。

l = {"one" : "1", "two" : "2", "three" : "3"}

print(l)

l = {1: "one", 2: "two", 3: "three", 4: "four", 5: "five"}

print(l)

l = {2: "two", 3: "three", 4: "four", 1: "one", 5: "five"}

print(l)

结果:

{'three': '3', 'two': '2', 'one': '1'}

{1: 'one', 2: 'two', 3: 'three', 4: 'four', 5: 'five'}

{1: 'one', 2: 'two', 3: 'three', 4: 'four', 5: 'five'}

Python 使用 hash table 存储字典,因此在字典或其他使用散列函数的对象中没有顺序。

但是关于散列对象中项目的索引,python根据以下代码计算索引within hashtable.c:

key_hash = ht->hash_func(key);
index = key_hash & (ht->num_buckets - 1);

所以由于整数的哈希值是整数本身索引是基于数字的(ht->num_buckets - 1是一个常数)所以索引是通过Bitwise-and在(ht->num_buckets - 1)和数字之间计算的.

考虑以下 set 的示例,该示例使用 hash-table :

>>> set([0,1919,2000,3,45,33,333,5])
set([0, 33, 3, 5, 45, 333, 2000, 1919])

对于数字 33 我们有:

33 & (ht->num_buckets - 1) = 1

实际上是 :

'0b100001' & '0b111'= '0b1' # 1 the index of 33

注意 在这种情况下 (ht->num_buckets - 1)8-1=70b111

对于1919

'0b11101111111' & '0b111' = '0b111' # 7 the index of 1919

对于333

'0b101001101' & '0b111' = '0b101' # 5 the index of 333

有关 python 哈希函数的更多详细信息,最好阅读 python source code 中的以下引述:

Major subtleties ahead: Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't: its most important hash functions (for strings and ints) are very regular in common cases:

>>> map(hash, (0, 1, 2, 3))
  [0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
  [-1658398457, -1658398460, -1658398459, -1658398462]

This isn't necessarily bad! To the contrary, in a table of size 2**i, taking the low-order i bits as the initial table index is extremely fast, and there are no collisions at all for dicts indexed by a contiguous range of ints. The same is approximately true when keys are "consecutive" strings. So this gives better-than-random behavior in common cases, and that's very desirable.

OTOH, when collisions occur, the tendency to fill contiguous slices of the hash table makes a good collision resolution strategy crucial. Taking only the last i bits of the hash code is also vulnerable: for example, consider the list [i << 16 for i in range(20000)] as a set of keys. Since ints are their own hash codes, and this fits in a dict of size 2**15, the last 15 bits of every hash code are all 0: they all map to the same table index.

But catering to unusual cases should not slow the usual ones, so we just take the last i bits anyway. It's up to collision resolution to do the rest. If we usually find the key we're looking for on the first try (and, it turns out, we usually do -- the table load factor is kept under 2/3, so the odds are solidly in our favor), then it makes best sense to keep the initial index computation dirt cheap.