GluonCV - 在对象检测中使用 GPU 进行推理
GluonCV - Use GPU for inference in object detection
我在 Ubuntu 18.04 和 Python.
上使用 GluonCV 进行对象检测
我在自定义数据集上重新训练了 ssd_512_resnet50_v1_custom 模型,我想在配备 GeForce RTX 2080 Ti GPU 的服务器上测试推理 FPS(在我的电脑上运行良好CPU).
所以,我 运行
def main():
try:
a = mx.nd.zeros((1,), ctx=mx.gpu(1))
ctx = [mx.gpu(1)]
except:
ctx = [mx.cpu()]
# -------------------------
# Load model
# -------------------------
classes = ['Guitar', 'face']
net = model_zoo.get_model('ssd_512_resnet50_v1_custom', ctx=ctx, classes=classes, pretrained_base=False)
net.load_parameters('saved_weights/test_000/ep_30.params')
# Load the webcam handler
cap = cv2.VideoCapture("video/video_01.mp4")
count_frame = 0
loading_frame_FPSs = np.zeros(844)
pre_processing_FPSs = np.zeros(844)
inference_FPSs = np.zeros(844)
total_FPSs = np.zeros(844)
while(True):
print(f"Frame: {count_frame}")
total_t_frame = 0
#######
start_t = time.time()
#######
# Load frame from the camera
ret, frame = cap.read()
#######
stop_t = time.time()
total_t_frame += (stop_t - start_t)
FPS = 1/(stop_t-start_t)
loading_frame_FPSs[count_frame] = FPS
print(f"\tloading frame time = {(stop_t-start_t)} -> FPS = {FPS}")
#######
if (cv2.waitKey(25) & 0xFF == ord('q')) or (ret == False):
cv2.destroyAllWindows()
cap.release()
print("Done!!!")
break
#######
start_t = time.time()
#######
# Image pre-processing
frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')
rgb_nd, frame = gcv.data.transforms.presets.ssd.transform_test(frame, short=512, max_size=700)
#######
stop_t = time.time()
total_t_frame += (stop_t - start_t)
FPS = 1/(stop_t-start_t)
pre_processing_FPSs[count_frame] = FPS
print(f"\timage pre-processing time = {(stop_t-start_t)} -> FPS = {FPS}")
#######
#######
start_t = time.time()
#######
# Run frame through network
class_IDs, scores, bounding_boxes = net(rgb_nd)
#######
stop_t = time.time()
total_t_frame += (stop_t - start_t)
FPS = 1/(stop_t-start_t)
inference_FPSs[count_frame] = FPS
print(f"\tinference time = {(stop_t-start_t)} -> FPS = {1/(stop_t-start_t)}")
#######
print(f"\tTotal frame FPS = {1/total_t_frame}")
total_FPSs[count_frame] = 1/total_t_frame
count_frame += 1
cv2.destroyAllWindows()
cap.release()
print(f"Average FPS for:")
print(f"\tloading frame: {np.average(loading_frame_FPSs)}")
print(f"\tpre-processingg frame: {np.average(pre_processing_FPSs)}")
print(f"\tinference frame: {np.average(inference_FPSs)}")
print(f"\ttotal process: {np.average(total_FPSs)}")
if __name__ == "__main__":
main()
所以,基本上我是在测量每个推理步骤(加载帧、调整大小、推理)所需的时间,并计算每个步骤的 FPS 和总计。
查看输出:
Average FPS for:
loading frame: 813.3313447171636
pre-processingg frame: 10.488629638752457
inference frame: 101.50787170217922
total process: 9.300166489874748
看来瓶颈主要出在图像的预处理上。
检查 nvidia-smi 的输出时,我得到:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:18:00.0 Off | N/A |
| 36% 63C P0 79W / 250W | 10MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:3B:00.0 Off | N/A |
| 37% 65C P2 84W / 250W | 715MiB / 10989MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:86:00.0 Off | N/A |
| 37% 64C P0 70W / 250W | 10MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:AF:00.0 Off | N/A |
| 37% 62C P2 116W / 250W | 2401MiB / 10989MiB | 47% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 2955 C python 705MiB |
| 3 15558 C python 2389MiB |
+-----------------------------------------------------------------------------+
我想这是合理的,因为我一次只使用一张图像进行推理,所以我不希望 GPU 使用率像训练期间那样高。
然而,在这一点上,有几件事我不确定:
- 在阅读有关 SSD 型号的平均 FPS 时,通常提到它们在 25-30 FPS 的范围内。我如何获得这些价值?都是图像预处理吗?
- 我尝试修改区块
try:
a = mx.nd.zeros((1,), ctx=mx.gpu(1))
ctx = [mx.gpu(1)]
except:
ctx = [mx.cpu()]
简单地:
ctx = mx.gpu(1)
但似乎这样的过程是 运行 在 CPU 上(甚至那些 715 MB 都没有被 GPU 占用)。这是为什么?
ctx = mx.gpu(0)
或 ctx = mx.gpu()
怎么样?
您试过下一页的脚本了吗?它的性能相同吗?也许使用您的模型并尝试使用相同的图像预处理方法
https://github.com/dmlc/gluon-cv/blob/master/scripts/detection/demo_webcam_run.py
我没有在 GPU 上正确加载图像,我不得不在 运行 推理之前添加一行:
rgb_nd = rgb_nd.as_in_context(ctx)
class_IDs, scores, bounding_boxes = net(rgb_nd)
增加了 GPU 内存使用,并解决了初始上下文初始化的问题。
此外,在评估推理速度时,我不得不使用一个块来等待结果实际可用,所以现在我得到的推理帧速率在 20 FPS 范围内,符合预期:
class_IDs, scores, bounding_boxes = net(rgb_nd)
if isinstance(class_IDs, mx.ndarray.ndarray.NDArray):
class_IDs.wait_to_read()
if isinstance(scores, mx.ndarray.ndarray.NDArray):
scores.wait_to_read()
if isinstance(bounding_boxes, mx.ndarray.ndarray.NDArray):
bounding_boxes.wait_to_read()
我在 Ubuntu 18.04 和 Python.
上使用 GluonCV 进行对象检测
我在自定义数据集上重新训练了 ssd_512_resnet50_v1_custom 模型,我想在配备 GeForce RTX 2080 Ti GPU 的服务器上测试推理 FPS(在我的电脑上运行良好CPU).
所以,我 运行
def main():
try:
a = mx.nd.zeros((1,), ctx=mx.gpu(1))
ctx = [mx.gpu(1)]
except:
ctx = [mx.cpu()]
# -------------------------
# Load model
# -------------------------
classes = ['Guitar', 'face']
net = model_zoo.get_model('ssd_512_resnet50_v1_custom', ctx=ctx, classes=classes, pretrained_base=False)
net.load_parameters('saved_weights/test_000/ep_30.params')
# Load the webcam handler
cap = cv2.VideoCapture("video/video_01.mp4")
count_frame = 0
loading_frame_FPSs = np.zeros(844)
pre_processing_FPSs = np.zeros(844)
inference_FPSs = np.zeros(844)
total_FPSs = np.zeros(844)
while(True):
print(f"Frame: {count_frame}")
total_t_frame = 0
#######
start_t = time.time()
#######
# Load frame from the camera
ret, frame = cap.read()
#######
stop_t = time.time()
total_t_frame += (stop_t - start_t)
FPS = 1/(stop_t-start_t)
loading_frame_FPSs[count_frame] = FPS
print(f"\tloading frame time = {(stop_t-start_t)} -> FPS = {FPS}")
#######
if (cv2.waitKey(25) & 0xFF == ord('q')) or (ret == False):
cv2.destroyAllWindows()
cap.release()
print("Done!!!")
break
#######
start_t = time.time()
#######
# Image pre-processing
frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')
rgb_nd, frame = gcv.data.transforms.presets.ssd.transform_test(frame, short=512, max_size=700)
#######
stop_t = time.time()
total_t_frame += (stop_t - start_t)
FPS = 1/(stop_t-start_t)
pre_processing_FPSs[count_frame] = FPS
print(f"\timage pre-processing time = {(stop_t-start_t)} -> FPS = {FPS}")
#######
#######
start_t = time.time()
#######
# Run frame through network
class_IDs, scores, bounding_boxes = net(rgb_nd)
#######
stop_t = time.time()
total_t_frame += (stop_t - start_t)
FPS = 1/(stop_t-start_t)
inference_FPSs[count_frame] = FPS
print(f"\tinference time = {(stop_t-start_t)} -> FPS = {1/(stop_t-start_t)}")
#######
print(f"\tTotal frame FPS = {1/total_t_frame}")
total_FPSs[count_frame] = 1/total_t_frame
count_frame += 1
cv2.destroyAllWindows()
cap.release()
print(f"Average FPS for:")
print(f"\tloading frame: {np.average(loading_frame_FPSs)}")
print(f"\tpre-processingg frame: {np.average(pre_processing_FPSs)}")
print(f"\tinference frame: {np.average(inference_FPSs)}")
print(f"\ttotal process: {np.average(total_FPSs)}")
if __name__ == "__main__":
main()
所以,基本上我是在测量每个推理步骤(加载帧、调整大小、推理)所需的时间,并计算每个步骤的 FPS 和总计。
查看输出:
Average FPS for:
loading frame: 813.3313447171636
pre-processingg frame: 10.488629638752457
inference frame: 101.50787170217922
total process: 9.300166489874748
看来瓶颈主要出在图像的预处理上。 检查 nvidia-smi 的输出时,我得到:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:18:00.0 Off | N/A |
| 36% 63C P0 79W / 250W | 10MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:3B:00.0 Off | N/A |
| 37% 65C P2 84W / 250W | 715MiB / 10989MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:86:00.0 Off | N/A |
| 37% 64C P0 70W / 250W | 10MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:AF:00.0 Off | N/A |
| 37% 62C P2 116W / 250W | 2401MiB / 10989MiB | 47% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 2955 C python 705MiB |
| 3 15558 C python 2389MiB |
+-----------------------------------------------------------------------------+
我想这是合理的,因为我一次只使用一张图像进行推理,所以我不希望 GPU 使用率像训练期间那样高。
然而,在这一点上,有几件事我不确定:
- 在阅读有关 SSD 型号的平均 FPS 时,通常提到它们在 25-30 FPS 的范围内。我如何获得这些价值?都是图像预处理吗?
- 我尝试修改区块
try:
a = mx.nd.zeros((1,), ctx=mx.gpu(1))
ctx = [mx.gpu(1)]
except:
ctx = [mx.cpu()]
简单地:
ctx = mx.gpu(1)
但似乎这样的过程是 运行 在 CPU 上(甚至那些 715 MB 都没有被 GPU 占用)。这是为什么?
ctx = mx.gpu(0)
或 ctx = mx.gpu()
怎么样?
您试过下一页的脚本了吗?它的性能相同吗?也许使用您的模型并尝试使用相同的图像预处理方法
https://github.com/dmlc/gluon-cv/blob/master/scripts/detection/demo_webcam_run.py
我没有在 GPU 上正确加载图像,我不得不在 运行 推理之前添加一行:
rgb_nd = rgb_nd.as_in_context(ctx)
class_IDs, scores, bounding_boxes = net(rgb_nd)
增加了 GPU 内存使用,并解决了初始上下文初始化的问题。
此外,在评估推理速度时,我不得不使用一个块来等待结果实际可用,所以现在我得到的推理帧速率在 20 FPS 范围内,符合预期:
class_IDs, scores, bounding_boxes = net(rgb_nd)
if isinstance(class_IDs, mx.ndarray.ndarray.NDArray):
class_IDs.wait_to_read()
if isinstance(scores, mx.ndarray.ndarray.NDArray):
scores.wait_to_read()
if isinstance(bounding_boxes, mx.ndarray.ndarray.NDArray):
bounding_boxes.wait_to_read()