tensorflow 对象检测 API：训练很慢

Question

我目前正在学习googletensorflow object detection API。当我尝试使用 Oxford III pet 数据集重新训练模型时，训练过程非常缓慢。

这是我目前的发现：

大部分时间只有 2% 的 GPU 被使用。
但是 CPU 利用率是 60%，所以看起来 GPU 没有被输入饿死，否则 CPU 应该接近 100% 利用率。

我正在尝试使用 tensorflow 分析器对其进行分析，但我现在有点赶时间，任何想法或建议都会有所帮助。

Answer 1

正如我所见，它没有像现在这样使用 GPU，您是否尝试使用给定参数

的 tensorflow 优化 GPU

https://www.tensorflow.org/performance/performance_guide#optimizing_for_gpu

Answer 2

我发现了问题。这是输入的问题，我的tfrecord文件不知何故损坏了，所以输入线程有时会挂起。

Answer 3

发生这种情况的原因有很多。最常见的是您的 record 文件有问题。在添加图像和记录文件的轮廓之前需要进行一些测试。其中一些是：

在将图像发送到记录之前先检查图像：

def checkJPG(fn):
    with tf.Graph().as_default():
        try:
            image_contents = tf.read_file(fn)
            image = tf.image.decode_jpeg(image_contents, channels=3)
            init_op = tf.initialize_all_tables()
            with tf.Session() as sess:
                sess.run(init_op)
                tmp = sess.run(image)
        except:
            print("Corrupted file: ", fn)
            return False
    return True

此外，检查轮廓的高度和宽度，以及是否有任何轮廓没有越过边界：

boxW = xmax - xmin
boxH = ymax - ymin
if boxW == 0 or boxH == 0:
    print("...ONE CONTOUR SKIPPED... (boxW | boxH) = 0")
    continue

if boxW*boxH < 100:
    print("...ONE CONTOUR SKIPPED... (boxW*boxH) < 100")
    continue

if xmin / width <= 0 or xmax / width <= 0 or ymin / height <= 0 or ymax / height <= 0:
    print("...ONE CONTOUR SKIPPED... (x | y) <= 0")
    continue
if xmin / width >= 1 or xmax / width >= 1 or ymin / height >= 1 or ymax / height >= 1:
    print("...ONE CONTOUR SKIPPED... (x | y) >= 1")
    continue

另一个原因是评估record文件中的数据太多。最好在你的评估记录文件中只添加 10 张图像，并像这样更改评估配置：

eval_config {
  num_visualizations: 10
  num_examples: 10
  eval_interval_secs: 3000
  max_evals: 1
  use_moving_averages: false
}

tensorflow 对象检测 API：训练很慢

tensorflow object detection API: training is very slow

object-detection

tensorflow