带有大过滤器的 tensorflow conv2d 的内存使用

Memory usage of tensorflow conv2d with large filters

我有一个带有一些相对较大 135 x 135 x 1 x 3 卷积滤波器的张量流模型。我发现 tf.nn.conv2d 变得无法用于如此大的过滤器 - 它试图使用超过 60GB 的内存,此时我需要杀死它。这是重现我的错误的最小脚本:

import tensorflow as tf
import numpy as np

frames, height, width, channels = 200, 321, 481, 1
filter_h, filter_w, filter_out = 5, 5, 3  # With this, output has shape (200, 317, 477, 3)
# filter_h, filter_w, filter_out = 7, 7, 3  # With this, output has shape (200, 315, 475, 3)
# filter_h, filter_w, filter_out = 135, 135, 3  # With this, output will be smaller than the above with shape (200, 187, 347, 3), but memory usage explodes

images = np.random.randn(frames, height, width, channels).astype(np.float32)

filters = tf.Variable(np.random.randn(filter_h, filter_w, channels, filter_out).astype(np.float32))
images_input = tf.placeholder(tf.float32)
conv = tf.nn.conv2d(images_input, filters, strides=[1, 1, 1, 1], padding="VALID")

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    result = sess.run(conv, feed_dict={images_input: images})

print result.shape

首先,谁能解释一下这种行为?为什么内存使用会随着过滤器大小而爆炸? (注意:我也尝试改变尺寸以使用单个 conv3d 而不是一批 conv2d,但这有同样的问题)

其次,除了将操作分解为 200 个单独的单图像卷积之外,还有谁能提出解决方案吗?

编辑: 重读 tf.nn.conv2d() 上的 docs 后,我在解释其工作原理时注意到了这一点:

  1. Flattens the filter to a 2-D matrix with shape [filter_height * filter_width * in_channels, output_channels].
  2. Extracts image patches from the input tensor to form a virtual tensor of shape [batch, out_height, out_width, filter_height * filter_width * in_channels].
  3. For each patch, right-multiplies the filter matrix and the image patch vector.

我最初只是将其作为过程的 描述 ,但如果 tensorflow 实际上是从图像中提取和存储单独的过滤器大小 'patches'引擎盖,然后粗略计算表明,在我的情况下,涉及的中间计算需要 ~130GB,远远超过我可以测试的限制。这 可能 回答我的第一个问题问题,但如果可以的话,谁能解释为什么当我仍然只在 CPU 上调试时 TF 会这样做?

I had originally taken this simply as a description of the process, but if tensorflow is actually extracting and storing separate filter-sized 'patches' from the image under the hood, then a back-of-the-envelope calculation shows that the intermediate computation involved requires ~130GB in my case, well over the limit that I could test.

正如你自己想的那样,这就是内存消耗大的原因。 Tensorflow 这样做是因为过滤器通常很小,并且计算矩阵乘法比计算卷积快很多。

can anyone explain why TF would do this when I'm still only debugging on a CPU?

您也可以在没有 GPU 的情况下使用 tensorflow,因此 CPU 实现不仅仅用于调试。它们还针对速度进行了优化,矩阵乘法在 CPU 和 GPU 上都更快。

要使大型过滤器的卷积成为可能,您必须在 C++ 中为大型过滤器实现卷积并将其作为新运算添加到 tensorflow。