与 CPython 相比，PyPy 占用大量内存

Question

我使用 python 解决了 SPOJ 的大输入测试 problem 并且遇到了非常难得的运行ge 事件。我提交了相同代码使用PyPy和Python 2.结果如下图：

与 CPython 相比，使用 PyPy 的代码运行如预期的那样快得多。但与此同时，内存使用量增加了惊人的 7 倍！我在网上搜索了一下，但没有找到任何证据表明 PyPy 的内存使用量比 CPython 多得多。有人能解释一下内存使用的巨大差异吗？

我也考虑过可能是我的代码问题。因此，我在下面发布了我的代码：

import io, sys, atexit, os
sys.stdout = io.BytesIO()
atexit.register(lambda: sys.__stdout__.write(sys.stdout.getvalue()))
sys.stdin = io.BytesIO(sys.stdin.read())
raw_input = lambda: sys.stdin.readline().rstrip()

line = list(map(int,raw_input().split()))
num, k = line
ans = 0

for i in xrange(0,num):
    if int(raw_input())%k == 0:
        ans += 1;

print(ans)

有人可以告诉我吗？

Answer 1

首先，我无法重现结果。不知道 SPOJ 使用了哪些 versions/set-ups。对于以下实验，使用了 PyPy 5.8.0 和 CPython 2.7.12。

作为测试用例，使用了大小约为 110MB 的最大可能输入文件：

#create_data.py
print 10**6, 33
for i in xrange(10**6):
  print 10**9

>> python create_data.py > input.in

现在运行 /usr/bin/time -v XXX solution.py < input.py 产量：

Interpreter     MaximalResidentSize 
PyPy:                 278 Mb
CPython:              222 Mb

PyPy 需要更多内存。 CPython 和 PyPy 使用不同的垃圾收集器策略，我认为 PyPy 的权衡是更快但使用更多内存。来自 PyPy 的人 great article 关于他们的垃圾收集器及其与 CPython 的比较。

其次，我不相信来自 SPJO 站点的数字。 system.stdin.read() 会将整个文件读入内存。 python 文档甚至 says:

To read a file’s contents, call f.read(size), which reads some quantity of data and returns it as a string. size is an optional numeric argument. When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory.

假设最坏的情况已包含在他们的测试用例中，内存使用量应至少为文件大小 (110 MB)，因为您使用 std.stdin.read() 甚至是文件大小的两倍，因为你正在处理数据。

实际上，我不确定，所有的麻烦都是值得的 - 使用 raw_input() 可能足够快 - 我只相信 python 做正确的事。 CPython 通常缓冲 stdout 和 stdin（如果它们被重定向到文件，则完全缓冲，或者为控制台行缓冲）并且您必须使用命令行选项 -u 到 switch it off.

但如果你真的想确定，你可以使用 sys.stdin 的文件对象迭代器，因为正如 CPython 手册页所述：

-u Force stdin, stdout and stderr to be totally unbuffered. On systems where it matters, also put stdin, stdout and stderr in binary mode. Note that there is internal buffering in xread‐ lines(), readlines() and file-object iterators ("for line in sys.stdin") which is not influenced by this option. To work around this, you will want to use "sys.stdin.readline()" inside a "while 1:" loop.

这意味着您的程序可能如下所示：

import sys
num, k = map(int,raw_input().split())
ans = 0    
for line in sys.stdin:
    if int(line)%k == 0:
        ans += 1
print(ans)

这有一个很大的优势，即此变体仅使用大约 7MB 的内存。

另一个教训是如果你害怕有人在非缓冲模式下运行你的程序，你不应该使用 sys.stdin.readline()。

一些进一步的实验（我的 cpu 时钟下降）

                   CPython        CPython -u         PyPy         PyPy -u
original        28sec/221MB      25sec/221MB       3sec/278MB    3sec/278MB
raw_input()     29sec/7MB        110sec/7MB        7sec/75MB    100sec/63MB
readline()     38sec/7MB        130sec/7MB        5sec/75MB    100sec/63MB
readlines()    20sec/560MB      20sec/560MB       4sec/1.4GB    4sec/1.4G
file-iterator    17sec/7MB       17sec/7MB         4sec/68MB    100sec/62MB

有一些要点：

raw_input() 和 sys.stdin.read_line() 具有相同的性能
raw_input() 是缓冲的，但是这个缓冲区似乎与文件对象迭代器的缓冲区有点不同，它至少在这个文件上优于 raw_input()。
sys.stdin.readlines() 的内存开销似乎相当高，至少只要行很短。
文件对象迭代器在 CPython 和 PyPy 中有不同的行为，如果使用选项 -u：对于 PyPy -u 也会关闭文件对象迭代器的缓冲（可能是一个错误？） .

与 CPython 相比，PyPy 占用大量内存

PyPy large memory usage compared to CPython

python

io

pypy

cpython