Tensorflow 对象检测 API RCNN 在 CPU 上很慢：每分钟 1 帧

Question

我正在使用来自 tensorflow 对象检测的本地训练模型 API。我正在使用 faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017 检查点。我重新训练了一个 1 class 模型并将其导出到 SavedModel

python object_detection/export_inference_graph.py \
    --input_type image_tensor \
    --pipeline_config_path ${PIPELINE_CONFIG_PATH} \
    --trained_checkpoint_prefix /Users/Ben/Dropbox/GoogleCloud/Detection/train/model.ckpt-186\
    --output_directory /Users/Ben/Dropbox/GoogleCloud/Detection/SavedModel/

虽然我知道还有其他更浅的模型，但据报道 RCNN are more than 100x faster 比我看到的要运行次。任何人都可以在 CPU 上使用他们更快的 RCNN 运行时间来证实吗？我正在尝试判断它是否是我的代码的问题，或者只是移动到较小的模型。

我从 juypter notebook 获取代码，几乎没有改动。我运行在一个干净的 virtualenv 中，只安装了要求。

detection_predict.py

import numpy as np
import tensorflow as tf
from PIL import Image
import glob
from collections import defaultdict
from io import StringIO
from matplotlib import pyplot as plt
from PIL import Image
from object_detection.utils import label_map_util
from object_detection.utils import visualization_utils as vis_util
import os
import datetime

TEST_IMAGE_PATHS = glob.glob("/Users/Ben/Dropbox/GoogleCloud/Detection/images/validation/*.jpg")

# Size, in inches, of the output images. ?
IMAGE_SIZE = (12, 8)
NUM_CLASSES = 1

sess=tf.Session()
tf.saved_model.loader.load(sess,[tf.saved_model.tag_constants.SERVING], "/Users/ben/Dropbox/GoogleCloud/Detection/SavedModel/saved_model/")    

label_map = label_map_util.load_labelmap("label.pbtxt")
categories = label_map_util.convert_label_map_to_categories(label_map, max_num_classes=NUM_CLASSES, use_display_name=True)
category_index = label_map_util.create_category_index(categories)

def load_image_into_numpy_array(image):
    (im_width, im_height) = image.size
    npdata=np.array(image.getdata()).reshape((im_height, im_width, 3)).astype(np.uint8)   
    return npdata

# Definite input and output Tensors for sess.graph
image_tensor = sess.graph.get_tensor_by_name('image_tensor:0')

# Each box represents a part of the image where a particular object was detected.
detection_boxes = sess.graph.get_tensor_by_name('detection_boxes:0')

# Each score represent how level of confidence for each of the objects.
# Score is shown on the result image, together with the class label.
detection_scores = sess.graph.get_tensor_by_name('detection_scores:0')
detection_classes = sess.graph.get_tensor_by_name('detection_classes:0')
num_detections = sess.graph.get_tensor_by_name('num_detections:0')
for image_path in TEST_IMAGE_PATHS:

    image = Image.open(image_path)

    #basewidth = 300
    #wpercent = (basewidth/float(image.size[0]))
    #hsize = int((float(image.size[1])*float(wpercent)))
    #image = image.resize((basewidth,hsize), Image.ANTIALIAS)

    # the array based representation of the image will be used later in order to prepare the
    # result image with boxes and labels on it.
    image_np = load_image_into_numpy_array(image)

    # Expand dimensions since the model expects images to have shape: [1, None, None, 3]
    image_np_expanded = np.expand_dims(image_np, axis=0)
    # Actual detection.
    before = datetime.datetime.now()    
    (boxes, scores, classes, num) = sess.run([detection_boxes, detection_scores, detection_classes, num_detections],feed_dict={image_tensor: image_np_expanded})
    print("Prediction took : " + str(datetime.datetime.now() - before))  

    # Visualization of the results of a detection.
    vis_util.visualize_boxes_and_labels_on_image_array(image_np, np.squeeze(boxes), np.squeeze(classes).astype(np.int32), np.squeeze(scores), category_index, use_normalized_coordinates=True,line_thickness=8)
    plt.figure(figsize=IMAGE_SIZE)
    fn=os.path.basename(image_path)
    plt.imsave("/Users/Ben/Dropbox/GoogleCloud/Detection/validation/" + fn,image_np)

产量

(detection) Bens-MacBook-Pro:Detection ben$ python detection_predict.py 

Prediction took : 0:00:51.475269
Prediction took : 0:00:43.955962

调整图像大小没有任何区别（上面已注释掉）。它们并不大 (1280 X 720)。

这是预期的吗？

系统信息

最新的 Tensorflow 版本

Bens-MacBook-Pro:Detection ben$ python
Python 2.7.10 (default, Feb  7 2017, 00:08:15) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.__version__
'1.3.0'

编辑 #1

如果有人想知道，从冻结的推理图中进行预测没有任何区别。

detection_graph = tf.Graph()
with detection_graph.as_default():
    od_graph_def = tf.GraphDef()
    with tf.gfile.GFile("/Users/ben/Dropbox/GoogleCloud/Detection/SavedModel/frozen_inference_graph.pb", 'rb') as fid:
        serialized_graph = fid.read()
        od_graph_def.ParseFromString(serialized_graph)
        tf.import_graph_def(od_graph_def, name='')

(detection) Bens-MacBook-Pro:Detection ben$ python detection_predict.py 

Prediction took : 0:01:02.651046
Prediction took : 0:00:43.820992
Prediction took : 0:00:48.805432

cProfile 不是特别有启发性

>>> stats.print_stats(20)
Thu Oct 19 14:55:47 2017    profiling_results

         40742812 function calls (38600273 primitive calls) in 173.800 seconds

   Ordered by: internal time
   List reduced from 4918 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        3  138.345   46.115  138.345   46.115 {_pywrap_tensorflow_internal.TF_Run}
977635/702731    2.852    0.000    9.200    0.000 /Users/ben/Documents/DeepMeerkat/training/Detection/detection/lib/python2.7/site-packages/google/protobuf/internal/python_message.py:469(init)
        3    2.597    0.866    2.597    0.866 {matplotlib._png.write_png}
    10719    2.111    0.000    2.114    0.000 {numpy.core.multiarray.array}
   363351    1.378    0.000    3.216    0.000 /Users/ben/Documents/DeepMeerkat/training/Detection/detection/lib/python2.7/site-packages/google/protobuf/internal/python_message.py:424(MakeSubMessageDefault)
  1045442    1.342    0.000    1.342    0.000 {_weakref.proxy}
562666/310637    1.317    0.000    6.182    0.000 /Users/ben/Documents/DeepMeerkat/training/Detection/detection/lib/python2.7/site-packages/google/protobuf/internal/python_message.py:1211(MergeFrom)
   931022    1.268    0.000    3.113    0.000 /Users/ben/Documents/DeepMeerkat/training/Detection/detection/lib/python2.7/site-packages/google/protobuf/internal/python_message.py:777(ListFields)
789671/269414    1.122    0.000    9.116    0.000 /Users/ben/Documents/DeepMeerkat/training/Detection/detection/lib/python2.7/site-packages/google/protobuf/internal/python_message.py:1008(ByteSize)
  1045442    0.882    0.000    2.498    0.000 /Users/ben/Documents/DeepMeerkat/training/Detection/detection/lib/python2.7/site-packages/google/protobuf/internal/python_message.py:1375(__init__)
3086143/3086140    0.662    0.000    0.756    0.000 {isinstance}
  1427511    0.656    0.000    0.782    0.000 /Users/ben/Documents/DeepMeerkat/training/Detection/detection/lib/python2.7/site-packages/google/protobuf/internal/python_message.py:762(_IsPresent)
   931092    0.649    0.000    0.879    0.000 {method 'sort' of 'list' objects}
1189105/899500    0.599    0.000    0.942    0.000 /Users/ben/Documents/DeepMeerkat/training/Detection/detection/lib/python2.7/site-packages/google/protobuf/internal/python_message.py:1330(Modified)
        1    0.537    0.537    0.537    0.537 {_pywrap_tensorflow_internal.TF_ExtendGraph}
276877/45671    0.480    0.000    8.315    0.000 /Users/ben/Documents/DeepMeerkat/training/Detection/detection/lib/python2.7/site-packages/google/protobuf/internal/python_message.py:1050(InternalSerialize)
  2602117    0.480    0.000    0.480    0.000 {method 'items' of 'dict' objects}
   459805    0.474    0.000    1.336    0.000 /Users/ben/Documents/DeepMeerkat/training/Detection/detection/lib/python2.7/site-packages/google/protobuf/internal/containers.py:551(__getitem__)
        1    0.434    0.434   16.605   16.605 /Users/ben/Documents/DeepMeerkat/training/Detection/detection/lib/python2.7/site-packages/tensorflow/python/framework/importer.py:156(import_graph_def)
  1297794    0.367    0.000    0.367    0.000 {method 'write' of '_io.BytesIO' objects}

编辑#2

在努力推动这一点之后，我开始怀疑那些报告速度更快的人在记录他们的环境方面不够严谨。一些 GPU 检查点在这里供感兴趣的人使用。

https://github.com/tensorflow/models/issues/1715

我将问题悬而未决，希望有人会报告最大模型的 CPU 时间，但我继续认为目前这是正确的并转向较浅的模型.也许这将有助于其他人决定选择哪种模型。

Answer 1

检查下面的回复 link https://medium.com/@vaibhavsahu/hey-ben-3a2ff902303d

我使用的是 nvidia GeForce GTX 1060 6GB GPU。但是，当您运行您的 detection_predict.py （来自 Whosebug）时，它会花费一些时间，因为它每次都会在内存中加载模型。这种情况下的模型会很大，我有 180MB 大小的模型。这就是为什么您必须将模型加载到内存中一次并每次都从加载的模型中进行检测。使用它仅在第一次使用时需要时间。以下检测会更快。您可以使用 jupyter notebook 执行此操作。此外，每次检测使用 with 语句时都会增加检测时间。在给定的笔记本中

with detection_graph.as_default():
  with tf.Session(graph=detection_graph) as sess:

将其更改为，

with detection_graph.as_default():
  sess = tf.Session(graph=detection_graph)

并放入不同的单元格，运行一次然后每次都在另一个单元格中进行检测

# Definite input and output Tensors for detection_graph
image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')
# Each box represents a part of the image where a particular object was detected.
detection_boxes = detection_graph.get_tensor_by_name('detection_boxes:0')
# Each score represent how level of confidence for each of the objects.
# Score is shown on the result image, together with the class label.
detection_scores = detection_graph.get_tensor_by_name('detection_scores:0')
detection_classes = detection_graph.get_tensor_by_name('detection_classes:0')
num_detections = detection_graph.get_tensor_by_name('num_detections:0')
for image_path in TEST_IMAGE_PATHS:
  image = Image.open(image_path)
  # the array based representation of the image will be used later in order to prepare the
  # result image with boxes and labels on it.
  image_np = load_image_into_numpy_array(image)
  # Expand dimensions since the model expects images to have shape: [1, None, None, 3]
  image_np_expanded = np.expand_dims(image_np, axis=0)
  # Actual detection.
  (boxes, scores, classes, num) = sess.run(
      [detection_boxes, detection_scores, detection_classes, num_detections],
      feed_dict={image_tensor: image_np_expanded})
  # Visualization of the results of a detection.
  vis_util.visualize_boxes_and_labels_on_image_array(
      image_np,
      np.squeeze(boxes),
      np.squeeze(classes).astype(np.int32),
      np.squeeze(scores),
      category_index,
      use_normalized_coordinates=True,
      line_thickness=8)
  plt.figure(figsize=IMAGE_SIZE)
  plt.imshow(image_np)

这应该会大大缩短时间。

Answer 2

希望对其他用户选型有所帮助。这是我在 OSX 上报告的 3.1 Ghz CPU 处理器的平均时间（更多信息见上文）。

faster_rcnn_inception_resnet_v2_atrous_coco: 45 sec/image

faster_rcnn_resnet101_coco: 16 sec/image

fcn_resnet101_coco: 7 sec/image

ssd_inception_v2_coco: 0.3 sec/image

ssd_mobilenet_v1_coco: 0.3 sec/image

Answer 3

尝试 Tensorflow Performance Guide（针对 CPU 的一般最佳实践和优化）中的建议可能会有所帮助。具体来说，从源代码安装 TF 并更改输入管道似乎有望提高性能。

此外，Graph Transform Tool 可能值得一试。

我自己还没有尝试过上述方法，但我真的很想知道它们对性能的影响。

Answer 4

在我的 16GB RAM 和 2.5 GHz Intel Core i5 上，仅检测部分需要：

~5 秒/图像 faster_rcnn_resnet101_coco_2018_01_28
~1 秒/图像 ssd_mobilenet_v1_coco_2017_11_17

如果您要循环播放多个图像或运行视频中的帧，请注意 run_inference_for_single_image 方法会为每个图像调用。您可能想删除以下两行并将它们放在某个地方，以便它只被调用一次。

with detection_graph.as_default():
    with tf.Session() as sess:

Answer 5

如果您使用 Tensorflow Object Detection Jupyter Tutorial 中的示例。推理速度慢可能是由于将图像对象转换为 numpy 对象的过程造成的。下面是一个例子来证明这一点：

import numpy as np
from PIL import Image
import time
def load_image_into_numpy_array(image):
  (im_width, im_height) = image.size
  return np.expand_dims(np.array(image.getdata()).reshape((im_height, im_width, 3)).astype(np.uint8),axis=0)
def load_image_into_numpy_array_updated(image):
  return np.expand_dims(np.array(image).astype(np.uint8),axis=0)
if __name__=='__main__':
  image = Image.open('xxx.JPEG')
  # original load method
  s= time.time()
  for _ in range(10):
    y = load_image_into_numpy_array(image)
  e= time.time()
  print('Execution Time of old load method {}'.format((e-s)/10))
 # updated load method
  s= time.time()
  for _ in range(10):
    y = load_image_into_numpy_array_updated(image)
  e= time.time()
  print('Execution Time of updated load method {}'.format((e-s)/10))

结果如下：

Execution Time of old load method 0.4671137571334839
Execution Time of updated load method 0.001219463348388672

外卖np.array(image.getdata())速度极慢。另一种方法是将 PIL 图像对象直接提供给 np.array() 方法，如我的代码示例所示。

另一个加快推理速度的技巧是将TF Session创建代码从推理循环中取出（Session为所有后续推理创建一次）。

PS：我测试用的图片尺寸为1280*720

Answer 6

在 Mac Book Pro 上使用 faster_rcnn_resnet50_fgvc_2018_07_19 处理单个图像需要 8 分钟。

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   2740/1    0.044    0.000  471.234  471.234 {built-in method builtins.exec}
        1    0.319    0.319  471.227  471.227 detect_insect.py:1(<module>)
        1    0.004    0.004  355.473  355.473 detect_insect.py:72(run_inference_for_single_image)
        1    0.001    0.001  352.112  352.112 session.py:846(run)
        1    0.002    0.002  352.111  352.111 session.py:1091(_run)
        1    0.000    0.000  352.096  352.096 session.py:1318(_do_run)
        1    0.000    0.000  352.096  352.096 session.py:1363(_do_call)
        1    0.001    0.001  352.096  352.096 session.py:1346(_run_fn)
        1    0.002    0.002  347.445  347.445 session.py:1439(_call_tf_sessionrun)
        1  347.443  347.443  347.443  347.443 {built-in method _pywrap_tensorflow_internal.TF_SessionRun_wrapper}
        1    0.441    0.441   56.288   56.288 request.py:1775(retrieve)

Tensorflow 对象检测 API RCNN 在 CPU 上很慢：每分钟 1 帧

Tensorflow object detection API RCNN is slow on CPU: 1 frame per min

object-detection

tensorflow