运行 Python 脚本时 GNU 并行内存泄漏

Question

我有一个 python 脚本，它使用一个函数临时从存储桶下载文件，将它们转换为 ndarray，最后将它（最终大小 ~ 10GB）保存到另一个存储桶.

我需要运行这个脚本 ~200 次，所以我创建了一个 sh 文件 run_reshape.sh 来并行化遵循这个布局的运行s:

#!/bin/sh
python3 reshape.py 'group_1'
python3 reshape.py 'group_2'
...

我一直在尝试通过以下方式使用 GNU Parallel 并行化这些运行：

parallel --jobs 6 --tmpdir scratch/tmp --cleanup < run_reshape.sh

在不同内核上的 .py 脚本 2-3 次成功运行之后，我从 GNU Parallel 得到以下错误：

parallel: Error: Output is incomplete. Cannot append to buffer file in $TMPDIR. Is the disk full?
parallel: Error: Change $TMPDIR with --tmpdir or use --compress.

我不确定磁盘怎么会满。当我在 parallel 抛出错误后检查 free -m 时，磁盘上有 >120GB 的可用空间 space。

我检查了 .parallel/tmp/ 和 scratch/tmp/。 scratch/tmp/ 是空的，.parallel/tmp/ 中有一个 6 字节的文件。此外，python 脚本中的所有变量都位于一个函数内，该函数在调用时没有自己的变量赋值。作为额外的预防措施，我还删除了它们并在 reshape.py.

结束时调用 gc.collect()

非常感谢任何帮助！

额外信息

reshape.py 的基本大纲：

# Define reshape function
def reshape_images(arg):
    x_len = 1000

    new_shape = np.empty((x_len, 2048, 2048), dtype=(np.float16))
    new_shape[:] = np.nan

    for n in range(x_len):              
        with gcs_file_system.open(arg+str([n])+'.jpg') as file:
            im = Image.open(file)
            np_im = np.array(im, dtype='np.float16')
            new_shape[n]=np_im
            del im
            del np_im

    save_string = f'{arg}.npy'
    np.save(file_io.FileIO(f'{save_string}', 'w'), new_shape)
    del new_shape

# Run reshape function
reshape_images(sys.argv[1])

# Clear memory of namespace variables
gc.collect()

Answer 1

I'm not sure how the disk could be full. When I check free -m after parallel throws the error, I have >120GB of available space on disk.

您需要在 GNU Parallel 停止之前df scratch/tmp 。

GNU Parallel 打开 --tmpdir 中的临时文件，立即删除它们，但保持打开状态。这是为了避免在 GNU Parallel 被杀死时需要清理文件。

您很可能会发现以下情况：

scratch/tmp已满

scratch/tmp
中没有文件

但是一旦 GNU Parallel 结束，space 将免费。

所以如果你只看df GNU Parallel 完成后，你不会看磁盘已满的时间。

换句话说：当 scratch/tmp 太小时，您看到的是 100% 的正常行为。

尝试将 --tmpdir 设置为具有更多可用内容的目录 space。

或尝试：

seq 100000000 | parallel -uj1 -N0 df scratch/tmp

同时运行您的作业并看到磁盘已满。

运行 Python 脚本时 GNU 并行内存泄漏

GNU Parallel Memory Leak when Running Python Script

bash

gnu

temporary-files

python-3.x

gnu-parallel