在 python 中使用 file.seek() 时通常将多少字节加载到内存中？

Question

我目前正在使用一个 4 GB 大小的文件作为开放寻址哈希 table。为了读取每个偏移量，我对 1 字节（字符）数据使用 file.seek() 函数。我想使用存储桶优化文件的大小（在没有数据的偏移量上节省 space），为了实现最佳优化，我想知道在使用 [ 时缓存了多少字节到内存中=15=]()? 这样我就可以调整存储桶，这样文件将需要更少的 space 但磁盘 I/O 读取不会增加。

Answer 1

file.seek() 方法的内存效率很高，但也很慢。不过，您会希望通过页面边界对齐所有内容，因此我建议您不要跨越 4 kiB 边界。

如果您使用的是 64 位处理器，请不要使用 file.seek()，而是使用 mmap 将整个文件映射到内存中。然后您可以使用页面大小通常为 4 kiB 的规则，从而将所有内容对齐到 4 kiB 边界上。这肯定比假装使用 file.seek 更快；尽管最终可能会消耗更多内存，但操作系统可以微调您的访问模式。

在 Python 3 上，您将按如下方式使用 mmap：

# provided that your hashtable is in this file
# and its size is 4 GiB
with open("hashtable", "r+b") as f:
    # memory-map the file, size 0 means whole file
    mm = mmap.mmap(f.fileno(), 0)

    # here mm behaves like 4 billion element bytearray
    # that you can read from and write to. changes
    # are flushed to the underlying file.

    # set 1 byte in the file
    mm[123456789] = 42

    # ensure that changes are written to disk
    mm.flush()

    # close the mapping
    mm.close()

在 python 中使用 file.seek() 时通常将多少字节加载到内存中？

How many bytes are typically loaded into memory when using file.seek() in python?

python

size

hash

file

seek