为什么 python 多处理使用的 CPU 和 GPU 多于指定的并行进程数?
Why python multiprocessing used more CPU and GPU than the specified in-parallel process numbers?
我对 运行 8 个并行的 pytorch 进程使用 python 多处理(对于 8 个 CPU 核心和 8 个 GPU 线程)。但它消耗了 48 CPUs 和 24+ GPU 线程。有人知道如何将 48 CPU 和 24+ GPU 减少到 8 CPU 核心和 8 GPU 线程吗?
htop screenshot
(py38) [ec2-user@ip current]$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03 Driver Version: 450.119.03 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1B.0 Off | 0 |
| N/A 47C P0 28W / 70W | 8050MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:00:1C.0 Off | 0 |
| N/A 50C P0 29W / 70W | 8962MiB / 15109MiB | 11% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 On | 00000000:00:1D.0 Off | 0 |
| N/A 49C P0 28W / 70W | 9339MiB / 15109MiB | 9% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 49C P0 28W / 70W | 9761MiB / 15109MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 30971 C python 1167MiB |
| 0 N/A N/A 30973 C python 1135MiB |
| 0 N/A N/A 30974 C python 1135MiB |
| 0 N/A N/A 30975 C python 1135MiB |
| 0 N/A N/A 30976 C python 1195MiB |
| 0 N/A N/A 30977 C python 1115MiB |
| 0 N/A N/A 30978 C python 1163MiB |
| 1 N/A N/A 30971 C python 1259MiB |
| 1 N/A N/A 30972 C python 1241MiB |
| 1 N/A N/A 30973 C python 1295MiB |
| 1 N/A N/A 30975 C python 1273MiB |
| 1 N/A N/A 30976 C python 1287MiB |
| 1 N/A N/A 30977 C python 1269MiB |
| 1 N/A N/A 30978 C python 1333MiB |
| 2 N/A N/A 30971 C python 1263MiB |
| 2 N/A N/A 30972 C python 1163MiB |
| 2 N/A N/A 30973 C python 1167MiB |
| 2 N/A N/A 30974 C python 1135MiB |
| 2 N/A N/A 30975 C python 1135MiB |
| 2 N/A N/A 30976 C python 1167MiB |
| 2 N/A N/A 30977 C python 1137MiB |
| 2 N/A N/A 30978 C python 1167MiB |
| 3 N/A N/A 30971 C python 1195MiB |
| 3 N/A N/A 30972 C python 1291MiB |
| 3 N/A N/A 30973 C python 1175MiB |
| 3 N/A N/A 30974 C python 1235MiB |
| 3 N/A N/A 30975 C python 1181MiB |
| 3 N/A N/A 30976 C python 1153MiB |
| 3 N/A N/A 30977 C python 1263MiB |
| 3 N/A N/A 30978 C python 1263MiB |
+-----------------------------------------------------------------------------+
这里是相关的代码片段:
p = multiprocessing.Pool(processes=8)
for id in id_list:
p.apply_async(
evaluate,
[id],
)
def evaluate(id):
# PyTorch code ...
我没有深入研究多进程调度和内存释放的原理,而是改为循环遍历进程而不是视频列表,这样以可控且更简单的方式解决了问题。现在,我可以 运行 64 个具有完全指定的 GPU 内存的并行进程。
multiprocess_num = 8
batch_size = int(len(the_list) / multiprocess_num)
for i in range(multiprocess_num):
sub_list = the_list[i * batch_size:(i + 1) * batch_size]
p.apply_async(
evaluate,
[sub_list],
)
def evaluate(sub_list):
for id in sub_list:
# PyTorch code ...
我对 运行 8 个并行的 pytorch 进程使用 python 多处理(对于 8 个 CPU 核心和 8 个 GPU 线程)。但它消耗了 48 CPUs 和 24+ GPU 线程。有人知道如何将 48 CPU 和 24+ GPU 减少到 8 CPU 核心和 8 GPU 线程吗?
htop screenshot
(py38) [ec2-user@ip current]$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03 Driver Version: 450.119.03 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1B.0 Off | 0 |
| N/A 47C P0 28W / 70W | 8050MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:00:1C.0 Off | 0 |
| N/A 50C P0 29W / 70W | 8962MiB / 15109MiB | 11% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 On | 00000000:00:1D.0 Off | 0 |
| N/A 49C P0 28W / 70W | 9339MiB / 15109MiB | 9% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 49C P0 28W / 70W | 9761MiB / 15109MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 30971 C python 1167MiB |
| 0 N/A N/A 30973 C python 1135MiB |
| 0 N/A N/A 30974 C python 1135MiB |
| 0 N/A N/A 30975 C python 1135MiB |
| 0 N/A N/A 30976 C python 1195MiB |
| 0 N/A N/A 30977 C python 1115MiB |
| 0 N/A N/A 30978 C python 1163MiB |
| 1 N/A N/A 30971 C python 1259MiB |
| 1 N/A N/A 30972 C python 1241MiB |
| 1 N/A N/A 30973 C python 1295MiB |
| 1 N/A N/A 30975 C python 1273MiB |
| 1 N/A N/A 30976 C python 1287MiB |
| 1 N/A N/A 30977 C python 1269MiB |
| 1 N/A N/A 30978 C python 1333MiB |
| 2 N/A N/A 30971 C python 1263MiB |
| 2 N/A N/A 30972 C python 1163MiB |
| 2 N/A N/A 30973 C python 1167MiB |
| 2 N/A N/A 30974 C python 1135MiB |
| 2 N/A N/A 30975 C python 1135MiB |
| 2 N/A N/A 30976 C python 1167MiB |
| 2 N/A N/A 30977 C python 1137MiB |
| 2 N/A N/A 30978 C python 1167MiB |
| 3 N/A N/A 30971 C python 1195MiB |
| 3 N/A N/A 30972 C python 1291MiB |
| 3 N/A N/A 30973 C python 1175MiB |
| 3 N/A N/A 30974 C python 1235MiB |
| 3 N/A N/A 30975 C python 1181MiB |
| 3 N/A N/A 30976 C python 1153MiB |
| 3 N/A N/A 30977 C python 1263MiB |
| 3 N/A N/A 30978 C python 1263MiB |
+-----------------------------------------------------------------------------+
这里是相关的代码片段:
p = multiprocessing.Pool(processes=8)
for id in id_list:
p.apply_async(
evaluate,
[id],
)
def evaluate(id):
# PyTorch code ...
我没有深入研究多进程调度和内存释放的原理,而是改为循环遍历进程而不是视频列表,这样以可控且更简单的方式解决了问题。现在,我可以 运行 64 个具有完全指定的 GPU 内存的并行进程。
multiprocess_num = 8
batch_size = int(len(the_list) / multiprocess_num)
for i in range(multiprocess_num):
sub_list = the_list[i * batch_size:(i + 1) * batch_size]
p.apply_async(
evaluate,
[sub_list],
)
def evaluate(sub_list):
for id in sub_list:
# PyTorch code ...