MPI4PY 共享内存 - 访问时内存使用量激增

MPI4PY shared memory - memory usage spike on access

我正在使用共享内存与 mpi4py 共享一个大型 numpy 数组(一次写入,多次读取),利用共享 windows。我发现我可以毫无问题地设置共享数组,但是如果我尝试在任何不是引导进程的进程上访问该数组,那么我的内存使用量就会超出合理的限制。我有一个简单的代码片段来说明这里的应用程序:

from mpi4py import MPI
import numpy as np
import time
import sys

shared_comm = MPI.COMM_WORLD.Split_type(MPI.COMM_TYPE_SHARED)

is_leader = shared_comm.rank == 0

# Set up a large array as example
_nModes = 45
_nSamples = 512*5

float_size = MPI.DOUBLE.Get_size()

size = (_nModes, _nSamples, _nSamples)
if is_leader:
    total_size = np.prod(size)
    nbytes = total_size * float_size
else:
    nbytes = 0

# Create the shared memory, or get a handle based on shared communicator                                                                                                                                           
win = MPI.Win.Allocate_shared(nbytes, float_size, comm=shared_comm)
# Construct the array                                                                                                                                                                                              
buf, itemsize = win.Shared_query(0)
_storedZModes = np.ndarray(buffer=buf, dtype='d', shape=size)

# Fill the shared array with only the leader rank
if is_leader:
    _storedZModes[...] = np.ones(size)

shared_comm.Barrier()

# Access the array - if we don't do this, then memory usage is as expected. If I do this, then I find that memory usage goes up to twice the size, as if it's copying the array on access
if shared_comm.rank == 1:
    # Do a (bad) explicit sum to make clear it is not a copy problem within numpy sum()
    SUM = 0.
    for i in range(_nModes):
        for j in range(_nSamples):
            for k in range(_nSamples):
                SUM = SUM + _storedZModes[i,j,k]                                                                                                                                               

# Wait for a while to make sure slurm notices any issues before finishing
time.sleep(500)

通过上述设置,共享数组应该占用大约 2.3GB,这在 运行 代码和查询时得到确认。如果我在单个节点上通过 4 个内核的 slurm 提交到一个队列,每个进程 0.75GB,它运行正常 只有当我不做总和 时。但是,如果进行求和(如图所示,或使用 np.sum 或类似方法),则 slurm 会抱怨超出了内存使用量。如果领导者排名求和,则不会发生这种情况。

每个进程 0.75GB,分配的总内存为 3GB,这为共享数组以外的所有其他内容分配了大约 0.6GB。这显然应该足够了。

似乎在leader以外的任何进程上访问内存都是复制内存,这显然是没有用的。我是不是做错了什么?

编辑

我玩过 window 击剑,并使用 put/get 如下。我仍然有同样的行为。如果有人运行它并且没有重现问题,那对我来说仍然是有用的信息:)

from mpi4py import MPI
import numpy as np
import time
import sys

shared_comm = MPI.COMM_WORLD.Split_type(MPI.COMM_TYPE_SHARED)
print("Shared comm contains: ", shared_comm.Get_size(), " processes")

shared_comm.Barrier()

leader_rank = 0
is_leader = shared_comm.rank == leader_rank

# Set up a large array as example
_nModes = 45
_nSamples = 512*5

float_size = MPI.DOUBLE.Get_size()

print("COMM has ", shared_comm.Get_size(), " processes")

size = (_nModes, _nSamples, _nSamples)
if is_leader:
    total_size = np.prod(size)
    nbytes = total_size * float_size
    print("Expected array size is ", nbytes/(1024.**3), " GB")
else:
    nbytes = 0

# Create the shared memory, or get a handle based on shared communicator                                                                  

shared_comm.Barrier()                      
win = MPI.Win.Allocate_shared(nbytes, float_size, comm=shared_comm)
# Construct the array                                                                                                                     

buf, itemsize = win.Shared_query(leader_rank)
_storedZModes = np.ndarray(buffer=buf, dtype='d', shape=size)

# Fill the shared array with only the leader rank
win.Fence()
if is_leader:
    print("RANK: ", shared_comm.Get_rank() , " is filling the array ")
    #_storedZModes[...] = np.ones(size)
    win.Put(np.ones(size), leader_rank, 0)
    print("RANK: ", shared_comm.Get_rank() , " SUCCESSFULLY filled the array ")
    print("Sum should return ", np.prod(size))
win.Fence()

# Access the array - if we don't do this, then memory usage is as expected. If I do this, then I find that memory usage goes up to twice t
he size, as if it's copying the array on access
if shared_comm.rank == 1:
    print("RANK: ", shared_comm.Get_rank() , " is querying the array "); sys.stdout.flush()
    # Do a (bad) explicit sum to make clear it is not a copy problem within numpy sum()
    SUM = 0.
    counter = -1; tSUM = np.empty((1,))
    for i in range(_nModes):
        for j in range(_nSamples):
            for k in range(_nSamples):
                if counter%10000 == 0:
                    print("Finished iteration: ", counter); sys.stdout.flush()
                counter += 1; win.Get(tSUM, leader_rank, counter); SUM += tSUM[0];
                #SUM = SUM + _storedZModes[i,j,k]                                                                                         

    print("RANK: ", shared_comm.Get_rank() , " SUCCESSFULLY queried the array ", SUM)

shared_comm.Barrier()

# Wait for a while to make sure slurm notices any issues before finishing
time.sleep(500)

回答

进一步的调查清楚地表明问题出在 slurm 中:一个有效地告诉 slurm 忽略共享内存的开关被关闭,打开它解决了这个问题。

在接受的答案中给出了为什么这会导致问题的描述。本质上,slurm 正在计算两个进程的总驻留内存。

我 运行 这有两个 MPI 任务,并用 toppmap 监视它们。

这些工具表明

_storedZModes[...] = np.ones(size)

确实分配了一个充满1的缓冲区,所以领导者需要的内存确实是2 * nbytes(驻留内存是2 * nbytes,其中包括nbytes在共享内存)。

来自top

top - 15:14:54 up 43 min,  4 users,  load average: 2.76, 1.46, 1.18
Tasks:   2 total,   1 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s): 27.5 us,  6.2 sy,  0.0 ni, 66.2 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem :  3881024 total,   161624 free,  2324936 used,  1394464 buff/cache
KiB Swap:   839676 total,   818172 free,    21504 used.  1258976 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6390 gilles    20   0 2002696  20580   7180 R 100.0  0.5   1:00.39 python
 6389 gilles    20   0 3477268   2.5g   1.1g D  12.3 68.1   0:02.41 python

一旦这个操作完成,填充1的缓冲区被释放,内存下降到nbytes(驻留内存~=共享内存)

当时注意,驻留内存和共享内存在任务 1 上都非常小。

top - 15:14:57 up 43 min,  4 users,  load average: 2.69, 1.47, 1.18
Tasks:   2 total,   1 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s): 27.2 us,  1.3 sy,  0.0 ni, 71.3 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem :  3881024 total,  1621860 free,   848848 used,  1410316 buff/cache
KiB Swap:   839676 total,   818172 free,    21504 used.  2735168 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6390 gilles    20   0 2002696  20580   7180 R 100.0  0.5   1:03.39 python
 6389 gilles    20   0 2002704   1.1g   1.1g S   2.0 30.5   0:02.47 python

在任务 1 上计算总和时,常驻内存和共享内存都增加到 nbytes

top - 15:18:09 up 46 min,  4 users,  load average: 0.33, 1.01, 1.06
Tasks:   2 total,   0 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s):  8.4 us,  2.9 sy,  0.0 ni, 88.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  3881024 total,  1297172 free,   854460 used,  1729392 buff/cache
KiB Swap:   839676 total,   818172 free,    21504 used.  2729768 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6389 gilles    20   0 2002704   1.1g   1.1g S   0.0 30.6   0:02.48 python
 6390 gilles    20   0 2002700   1.4g   1.4g S   0.0 38.5   2:34.42 python

最后,top 报告了两个进程大约 nbytes 的驻留内存,这大致是共享内存中相同 nbytes 的单个映射。

我不知道 SLURM 如何测量内存消耗... 如果它正确地考虑了共享内存,那么它应该没问题(例如 nbytes 已分配)。 但如果忽略它,它会认为你的工作分配了 2 * nbytes 的(驻留)内存,这可能太多了。

请注意,如果将初始化替换为

if is_leader:
    for i in range(_nModes):
        for j in range(_nSamples):
            for k in range(_nSamples):
                _storedZModes[i,j,k] = 1

未分配充满 1 的临时缓冲区,运行k 0 上的最大内存消耗是 nbytes 而不是 2 * nbytes