为什么 python 多处理使用的 CPU 和 GPU 多于指定的并行进程数?

Why python multiprocessing used more CPU and GPU than the specified in-parallel process numbers?

我对 运行 8 个并行的 pytorch 进程使用 python 多处理(对于 8 个 CPU 核心和 8 个 GPU 线程)。但它消耗了 48 CPUs 和 24+ GPU 线程。有人知道如何将 48 CPU 和 24+ GPU 减少到 8 CPU 核心和 8 GPU 线程吗?

htop screenshot

(py38) [ec2-user@ip current]$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03   Driver Version: 450.119.03   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   47C    P0    28W /  70W |   8050MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   50C    P0    29W /  70W |   8962MiB / 15109MiB |     11%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   49C    P0    28W /  70W |   9339MiB / 15109MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   49C    P0    28W /  70W |   9761MiB / 15109MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     30971      C   python                           1167MiB |
|    0   N/A  N/A     30973      C   python                           1135MiB |
|    0   N/A  N/A     30974      C   python                           1135MiB |
|    0   N/A  N/A     30975      C   python                           1135MiB |
|    0   N/A  N/A     30976      C   python                           1195MiB |
|    0   N/A  N/A     30977      C   python                           1115MiB |
|    0   N/A  N/A     30978      C   python                           1163MiB |
|    1   N/A  N/A     30971      C   python                           1259MiB |
|    1   N/A  N/A     30972      C   python                           1241MiB |
|    1   N/A  N/A     30973      C   python                           1295MiB |
|    1   N/A  N/A     30975      C   python                           1273MiB |
|    1   N/A  N/A     30976      C   python                           1287MiB |
|    1   N/A  N/A     30977      C   python                           1269MiB |
|    1   N/A  N/A     30978      C   python                           1333MiB |
|    2   N/A  N/A     30971      C   python                           1263MiB |
|    2   N/A  N/A     30972      C   python                           1163MiB |
|    2   N/A  N/A     30973      C   python                           1167MiB |
|    2   N/A  N/A     30974      C   python                           1135MiB |
|    2   N/A  N/A     30975      C   python                           1135MiB |
|    2   N/A  N/A     30976      C   python                           1167MiB |
|    2   N/A  N/A     30977      C   python                           1137MiB |
|    2   N/A  N/A     30978      C   python                           1167MiB |
|    3   N/A  N/A     30971      C   python                           1195MiB |
|    3   N/A  N/A     30972      C   python                           1291MiB |
|    3   N/A  N/A     30973      C   python                           1175MiB |
|    3   N/A  N/A     30974      C   python                           1235MiB |
|    3   N/A  N/A     30975      C   python                           1181MiB |
|    3   N/A  N/A     30976      C   python                           1153MiB |
|    3   N/A  N/A     30977      C   python                           1263MiB |
|    3   N/A  N/A     30978      C   python                           1263MiB |
+-----------------------------------------------------------------------------+

这里是相关的代码片段:

p = multiprocessing.Pool(processes=8)
for id in id_list:
      p.apply_async(
          evaluate,
          [id],
      )
  
def evaluate(id):
  # PyTorch code ...

我没有深入研究多进程调度和内存释放的原理,而是改为循环遍历进程而不是视频列表,这样以可控且更简单的方式解决了问题。现在,我可以 运行 64 个具有完全指定的 GPU 内存的并行进程。

multiprocess_num = 8
batch_size = int(len(the_list) / multiprocess_num)
for i in range(multiprocess_num):
      sub_list = the_list[i * batch_size:(i + 1) * batch_size]
      p.apply_async(
          evaluate,
          [sub_list],
      )

def evaluate(sub_list):
  for id in sub_list:
      # PyTorch code ...