Python 数据结构内存占用行为异常

Question

我正在尝试编程珍珠之一：

Given a file containing at most ten million 7-digit integers with no duplicates. What is an efficient way to print these numbers in ascending order using just 1.5Mb RAM and reading the data just once? What are the consequences of only having 1Mb of RAM and no other storage? How would your answer change if duplicates were permitted?

为了创建测试用例I，生成了8999999个数字并将它们写入文件。然后对于每一行，我开始将其插入到树中，最后创建一个 trie 结构。

示例代码：

from sys import getsizeof

tree = dict()
xtree = dict()
f = open("data2.txt", "r")
cnt = 0
for number in f:
    cnt += 1
    currTree = tree
    xtree[number] = dict()
    for n in number.strip():
        if n not in currTree:
            currTree[n] = dict()
        currTree = currTree[n]
f.close()

print(cnt)
print(getsizeof(tree))
print(getsizeof(xtree))
print(tree)

样本文件data2.txt有20条记录

生成的树是

现在的问题是，当我对构建的树进行内存大小时，在 20 行时它显示了 240 字节的内存足迹

在第 100 行，树的大小变为 368 字节

并且在 8999999 行也给出了 368 字节

我建立了一个名为 xtree 的辅助地图，它只是提供数据

xtree 和树的大小以字节为单位。

谁能解释一下这是怎么回事..？

Answer 1

您的 tree 只是一个最多包含 10 key-value 对的字典。在更大的树中，不再有 key-value 对。在 key-value 对中的 … 中的值中有更多的值，但字典中仍然只有 10 个 key-value 对。大约 10 key-value 对占用 368 字节的字典似乎是你应该期望的。¹

正如 getsizeof 的文档所说：

Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.

…

See recursive sizeof recipe for an example of using getsizeof() recursively to find the size of containers and all their contents.

因为你实际上并没有一个完全任意的数据结构，而只是一个 dicts of 等等。而且，虽然你 do 有一些共享引用（例如，如果你读取数字 1234567 而你已经在内存中有一个具有相同值的 int，Python 将重复使用相同的 object)，如果你试图验证你是否适合到 1.5MB，你真的想要 worst-case 测量，所以你可能想跳过 already-seen 值的检查。

所以，如果你愿意，你可以写一些更简单的东西而不是使用那个食谱。但思路是一样的：

def total_dict_size(d):
    size = sys.getsizeof(d)
    if isinstance(d, dict):
        for key, value in d.items():
            size += sys.getsizeof(key) + total_dict_size(value)
    return size

另一方面，您的 xtree 是一个包含 8999999 key-value 对的字典。进行相同的 back-of-the-envelope 计算，我预计它会略低于 300MB。相反，它有点超过 300MB。足够接近了。

并且您还在堆上存储了 8999999 个 7 位整数。取一些漂亮的整数，假设有 5M 个不同的整数不属于少数小值 pre-created 并由 CPython 缓存。这些整数中的每一个都小到足以放入一个 30 位数字，因此它们在 64 位 CPython 上每个占用 28 个字节。所以，如果你在上面调用递归函数tree 或 xtree.

因此，您在 tree、xtree 和实际整数之间的总内存使用量可能约为 750MB，这不太符合 < 1.5MB 的要求.

_{1.每个 Python object 都有一些固定的 header 开销，比如引用计数、指向类型的指针等，加上 type-specific 的东西，比如大多数容器的长度类型。称之为 64 字节。字典然后有一个散列 table。它需要比 10 个插槽大一点，以保持负载远低于 1.0；称之为 13 个插槽。每个槽都需要一个散列值、一个对键的引用和一个对值的引用，因此有 3 个指针或 24 个字节。 64 + 13 * 24 = 376。因此 back-of-the-envelope 计算仅偏离 8 个字节…}

Python 数据结构内存占用行为异常

Python Data Structure memory footprint behaving weird

python

memory-management

programming-pearls