MPI4PY 共享内存 - 访问时内存使用量激增
MPI4PY shared memory - memory usage spike on access
我正在使用共享内存与 mpi4py 共享一个大型 numpy 数组(一次写入,多次读取),利用共享 windows。我发现我可以毫无问题地设置共享数组,但是如果我尝试在任何不是引导进程的进程上访问该数组,那么我的内存使用量就会超出合理的限制。我有一个简单的代码片段来说明这里的应用程序:
from mpi4py import MPI
import numpy as np
import time
import sys
shared_comm = MPI.COMM_WORLD.Split_type(MPI.COMM_TYPE_SHARED)
is_leader = shared_comm.rank == 0
# Set up a large array as example
_nModes = 45
_nSamples = 512*5
float_size = MPI.DOUBLE.Get_size()
size = (_nModes, _nSamples, _nSamples)
if is_leader:
total_size = np.prod(size)
nbytes = total_size * float_size
else:
nbytes = 0
# Create the shared memory, or get a handle based on shared communicator
win = MPI.Win.Allocate_shared(nbytes, float_size, comm=shared_comm)
# Construct the array
buf, itemsize = win.Shared_query(0)
_storedZModes = np.ndarray(buffer=buf, dtype='d', shape=size)
# Fill the shared array with only the leader rank
if is_leader:
_storedZModes[...] = np.ones(size)
shared_comm.Barrier()
# Access the array - if we don't do this, then memory usage is as expected. If I do this, then I find that memory usage goes up to twice the size, as if it's copying the array on access
if shared_comm.rank == 1:
# Do a (bad) explicit sum to make clear it is not a copy problem within numpy sum()
SUM = 0.
for i in range(_nModes):
for j in range(_nSamples):
for k in range(_nSamples):
SUM = SUM + _storedZModes[i,j,k]
# Wait for a while to make sure slurm notices any issues before finishing
time.sleep(500)
通过上述设置,共享数组应该占用大约 2.3GB,这在 运行 代码和查询时得到确认。如果我在单个节点上通过 4 个内核的 slurm 提交到一个队列,每个进程 0.75GB,它运行正常 只有当我不做总和 时。但是,如果进行求和(如图所示,或使用 np.sum 或类似方法),则 slurm 会抱怨超出了内存使用量。如果领导者排名求和,则不会发生这种情况。
每个进程 0.75GB,分配的总内存为 3GB,这为共享数组以外的所有其他内容分配了大约 0.6GB。这显然应该足够了。
似乎在leader以外的任何进程上访问内存都是复制内存,这显然是没有用的。我是不是做错了什么?
编辑
我玩过 window 击剑,并使用 put/get 如下。我仍然有同样的行为。如果有人运行它并且没有重现问题,那对我来说仍然是有用的信息:)
from mpi4py import MPI
import numpy as np
import time
import sys
shared_comm = MPI.COMM_WORLD.Split_type(MPI.COMM_TYPE_SHARED)
print("Shared comm contains: ", shared_comm.Get_size(), " processes")
shared_comm.Barrier()
leader_rank = 0
is_leader = shared_comm.rank == leader_rank
# Set up a large array as example
_nModes = 45
_nSamples = 512*5
float_size = MPI.DOUBLE.Get_size()
print("COMM has ", shared_comm.Get_size(), " processes")
size = (_nModes, _nSamples, _nSamples)
if is_leader:
total_size = np.prod(size)
nbytes = total_size * float_size
print("Expected array size is ", nbytes/(1024.**3), " GB")
else:
nbytes = 0
# Create the shared memory, or get a handle based on shared communicator
shared_comm.Barrier()
win = MPI.Win.Allocate_shared(nbytes, float_size, comm=shared_comm)
# Construct the array
buf, itemsize = win.Shared_query(leader_rank)
_storedZModes = np.ndarray(buffer=buf, dtype='d', shape=size)
# Fill the shared array with only the leader rank
win.Fence()
if is_leader:
print("RANK: ", shared_comm.Get_rank() , " is filling the array ")
#_storedZModes[...] = np.ones(size)
win.Put(np.ones(size), leader_rank, 0)
print("RANK: ", shared_comm.Get_rank() , " SUCCESSFULLY filled the array ")
print("Sum should return ", np.prod(size))
win.Fence()
# Access the array - if we don't do this, then memory usage is as expected. If I do this, then I find that memory usage goes up to twice t
he size, as if it's copying the array on access
if shared_comm.rank == 1:
print("RANK: ", shared_comm.Get_rank() , " is querying the array "); sys.stdout.flush()
# Do a (bad) explicit sum to make clear it is not a copy problem within numpy sum()
SUM = 0.
counter = -1; tSUM = np.empty((1,))
for i in range(_nModes):
for j in range(_nSamples):
for k in range(_nSamples):
if counter%10000 == 0:
print("Finished iteration: ", counter); sys.stdout.flush()
counter += 1; win.Get(tSUM, leader_rank, counter); SUM += tSUM[0];
#SUM = SUM + _storedZModes[i,j,k]
print("RANK: ", shared_comm.Get_rank() , " SUCCESSFULLY queried the array ", SUM)
shared_comm.Barrier()
# Wait for a while to make sure slurm notices any issues before finishing
time.sleep(500)
回答
进一步的调查清楚地表明问题出在 slurm 中:一个有效地告诉 slurm 忽略共享内存的开关被关闭,打开它解决了这个问题。
在接受的答案中给出了为什么这会导致问题的描述。本质上,slurm 正在计算两个进程的总驻留内存。
我 运行 这有两个 MPI 任务,并用 top
和 pmap
监视它们。
这些工具表明
_storedZModes[...] = np.ones(size)
确实分配了一个充满1
的缓冲区,所以领导者需要的内存确实是2 * nbytes
(驻留内存是2 * nbytes
,其中包括nbytes
在共享内存)。
来自top
top - 15:14:54 up 43 min, 4 users, load average: 2.76, 1.46, 1.18
Tasks: 2 total, 1 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 27.5 us, 6.2 sy, 0.0 ni, 66.2 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 3881024 total, 161624 free, 2324936 used, 1394464 buff/cache
KiB Swap: 839676 total, 818172 free, 21504 used. 1258976 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6390 gilles 20 0 2002696 20580 7180 R 100.0 0.5 1:00.39 python
6389 gilles 20 0 3477268 2.5g 1.1g D 12.3 68.1 0:02.41 python
一旦这个操作完成,填充1
的缓冲区被释放,内存下降到nbytes
(驻留内存~=共享内存)
当时注意,驻留内存和共享内存在任务 1 上都非常小。
top - 15:14:57 up 43 min, 4 users, load average: 2.69, 1.47, 1.18
Tasks: 2 total, 1 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 27.2 us, 1.3 sy, 0.0 ni, 71.3 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 3881024 total, 1621860 free, 848848 used, 1410316 buff/cache
KiB Swap: 839676 total, 818172 free, 21504 used. 2735168 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6390 gilles 20 0 2002696 20580 7180 R 100.0 0.5 1:03.39 python
6389 gilles 20 0 2002704 1.1g 1.1g S 2.0 30.5 0:02.47 python
在任务 1 上计算总和时,常驻内存和共享内存都增加到 nbytes
。
top - 15:18:09 up 46 min, 4 users, load average: 0.33, 1.01, 1.06
Tasks: 2 total, 0 running, 2 sleeping, 0 stopped, 0 zombie
%Cpu(s): 8.4 us, 2.9 sy, 0.0 ni, 88.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 3881024 total, 1297172 free, 854460 used, 1729392 buff/cache
KiB Swap: 839676 total, 818172 free, 21504 used. 2729768 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6389 gilles 20 0 2002704 1.1g 1.1g S 0.0 30.6 0:02.48 python
6390 gilles 20 0 2002700 1.4g 1.4g S 0.0 38.5 2:34.42 python
最后,top
报告了两个进程大约 nbytes
的驻留内存,这大致是共享内存中相同 nbytes
的单个映射。
我不知道 SLURM 如何测量内存消耗...
如果它正确地考虑了共享内存,那么它应该没问题(例如 nbytes
已分配)。
但如果忽略它,它会认为你的工作分配了 2 * nbytes
的(驻留)内存,这可能太多了。
请注意,如果将初始化替换为
if is_leader:
for i in range(_nModes):
for j in range(_nSamples):
for k in range(_nSamples):
_storedZModes[i,j,k] = 1
未分配充满 1
的临时缓冲区,运行k 0 上的最大内存消耗是 nbytes
而不是 2 * nbytes
。
我正在使用共享内存与 mpi4py 共享一个大型 numpy 数组(一次写入,多次读取),利用共享 windows。我发现我可以毫无问题地设置共享数组,但是如果我尝试在任何不是引导进程的进程上访问该数组,那么我的内存使用量就会超出合理的限制。我有一个简单的代码片段来说明这里的应用程序:
from mpi4py import MPI
import numpy as np
import time
import sys
shared_comm = MPI.COMM_WORLD.Split_type(MPI.COMM_TYPE_SHARED)
is_leader = shared_comm.rank == 0
# Set up a large array as example
_nModes = 45
_nSamples = 512*5
float_size = MPI.DOUBLE.Get_size()
size = (_nModes, _nSamples, _nSamples)
if is_leader:
total_size = np.prod(size)
nbytes = total_size * float_size
else:
nbytes = 0
# Create the shared memory, or get a handle based on shared communicator
win = MPI.Win.Allocate_shared(nbytes, float_size, comm=shared_comm)
# Construct the array
buf, itemsize = win.Shared_query(0)
_storedZModes = np.ndarray(buffer=buf, dtype='d', shape=size)
# Fill the shared array with only the leader rank
if is_leader:
_storedZModes[...] = np.ones(size)
shared_comm.Barrier()
# Access the array - if we don't do this, then memory usage is as expected. If I do this, then I find that memory usage goes up to twice the size, as if it's copying the array on access
if shared_comm.rank == 1:
# Do a (bad) explicit sum to make clear it is not a copy problem within numpy sum()
SUM = 0.
for i in range(_nModes):
for j in range(_nSamples):
for k in range(_nSamples):
SUM = SUM + _storedZModes[i,j,k]
# Wait for a while to make sure slurm notices any issues before finishing
time.sleep(500)
通过上述设置,共享数组应该占用大约 2.3GB,这在 运行 代码和查询时得到确认。如果我在单个节点上通过 4 个内核的 slurm 提交到一个队列,每个进程 0.75GB,它运行正常 只有当我不做总和 时。但是,如果进行求和(如图所示,或使用 np.sum 或类似方法),则 slurm 会抱怨超出了内存使用量。如果领导者排名求和,则不会发生这种情况。
每个进程 0.75GB,分配的总内存为 3GB,这为共享数组以外的所有其他内容分配了大约 0.6GB。这显然应该足够了。
似乎在leader以外的任何进程上访问内存都是复制内存,这显然是没有用的。我是不是做错了什么?
编辑
我玩过 window 击剑,并使用 put/get 如下。我仍然有同样的行为。如果有人运行它并且没有重现问题,那对我来说仍然是有用的信息:)
from mpi4py import MPI
import numpy as np
import time
import sys
shared_comm = MPI.COMM_WORLD.Split_type(MPI.COMM_TYPE_SHARED)
print("Shared comm contains: ", shared_comm.Get_size(), " processes")
shared_comm.Barrier()
leader_rank = 0
is_leader = shared_comm.rank == leader_rank
# Set up a large array as example
_nModes = 45
_nSamples = 512*5
float_size = MPI.DOUBLE.Get_size()
print("COMM has ", shared_comm.Get_size(), " processes")
size = (_nModes, _nSamples, _nSamples)
if is_leader:
total_size = np.prod(size)
nbytes = total_size * float_size
print("Expected array size is ", nbytes/(1024.**3), " GB")
else:
nbytes = 0
# Create the shared memory, or get a handle based on shared communicator
shared_comm.Barrier()
win = MPI.Win.Allocate_shared(nbytes, float_size, comm=shared_comm)
# Construct the array
buf, itemsize = win.Shared_query(leader_rank)
_storedZModes = np.ndarray(buffer=buf, dtype='d', shape=size)
# Fill the shared array with only the leader rank
win.Fence()
if is_leader:
print("RANK: ", shared_comm.Get_rank() , " is filling the array ")
#_storedZModes[...] = np.ones(size)
win.Put(np.ones(size), leader_rank, 0)
print("RANK: ", shared_comm.Get_rank() , " SUCCESSFULLY filled the array ")
print("Sum should return ", np.prod(size))
win.Fence()
# Access the array - if we don't do this, then memory usage is as expected. If I do this, then I find that memory usage goes up to twice t
he size, as if it's copying the array on access
if shared_comm.rank == 1:
print("RANK: ", shared_comm.Get_rank() , " is querying the array "); sys.stdout.flush()
# Do a (bad) explicit sum to make clear it is not a copy problem within numpy sum()
SUM = 0.
counter = -1; tSUM = np.empty((1,))
for i in range(_nModes):
for j in range(_nSamples):
for k in range(_nSamples):
if counter%10000 == 0:
print("Finished iteration: ", counter); sys.stdout.flush()
counter += 1; win.Get(tSUM, leader_rank, counter); SUM += tSUM[0];
#SUM = SUM + _storedZModes[i,j,k]
print("RANK: ", shared_comm.Get_rank() , " SUCCESSFULLY queried the array ", SUM)
shared_comm.Barrier()
# Wait for a while to make sure slurm notices any issues before finishing
time.sleep(500)
回答
进一步的调查清楚地表明问题出在 slurm 中:一个有效地告诉 slurm 忽略共享内存的开关被关闭,打开它解决了这个问题。
在接受的答案中给出了为什么这会导致问题的描述。本质上,slurm 正在计算两个进程的总驻留内存。
我 运行 这有两个 MPI 任务,并用 top
和 pmap
监视它们。
这些工具表明
_storedZModes[...] = np.ones(size)
确实分配了一个充满1
的缓冲区,所以领导者需要的内存确实是2 * nbytes
(驻留内存是2 * nbytes
,其中包括nbytes
在共享内存)。
来自top
top - 15:14:54 up 43 min, 4 users, load average: 2.76, 1.46, 1.18
Tasks: 2 total, 1 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 27.5 us, 6.2 sy, 0.0 ni, 66.2 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 3881024 total, 161624 free, 2324936 used, 1394464 buff/cache
KiB Swap: 839676 total, 818172 free, 21504 used. 1258976 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6390 gilles 20 0 2002696 20580 7180 R 100.0 0.5 1:00.39 python
6389 gilles 20 0 3477268 2.5g 1.1g D 12.3 68.1 0:02.41 python
一旦这个操作完成,填充1
的缓冲区被释放,内存下降到nbytes
(驻留内存~=共享内存)
当时注意,驻留内存和共享内存在任务 1 上都非常小。
top - 15:14:57 up 43 min, 4 users, load average: 2.69, 1.47, 1.18
Tasks: 2 total, 1 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 27.2 us, 1.3 sy, 0.0 ni, 71.3 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 3881024 total, 1621860 free, 848848 used, 1410316 buff/cache
KiB Swap: 839676 total, 818172 free, 21504 used. 2735168 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6390 gilles 20 0 2002696 20580 7180 R 100.0 0.5 1:03.39 python
6389 gilles 20 0 2002704 1.1g 1.1g S 2.0 30.5 0:02.47 python
在任务 1 上计算总和时,常驻内存和共享内存都增加到 nbytes
。
top - 15:18:09 up 46 min, 4 users, load average: 0.33, 1.01, 1.06
Tasks: 2 total, 0 running, 2 sleeping, 0 stopped, 0 zombie
%Cpu(s): 8.4 us, 2.9 sy, 0.0 ni, 88.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 3881024 total, 1297172 free, 854460 used, 1729392 buff/cache
KiB Swap: 839676 total, 818172 free, 21504 used. 2729768 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6389 gilles 20 0 2002704 1.1g 1.1g S 0.0 30.6 0:02.48 python
6390 gilles 20 0 2002700 1.4g 1.4g S 0.0 38.5 2:34.42 python
最后,top
报告了两个进程大约 nbytes
的驻留内存,这大致是共享内存中相同 nbytes
的单个映射。
我不知道 SLURM 如何测量内存消耗...
如果它正确地考虑了共享内存,那么它应该没问题(例如 nbytes
已分配)。
但如果忽略它,它会认为你的工作分配了 2 * nbytes
的(驻留)内存,这可能太多了。
请注意,如果将初始化替换为
if is_leader:
for i in range(_nModes):
for j in range(_nSamples):
for k in range(_nSamples):
_storedZModes[i,j,k] = 1
未分配充满 1
的临时缓冲区,运行k 0 上的最大内存消耗是 nbytes
而不是 2 * nbytes
。