VGGnet 的架构。什么是多作物、密集评估？

Question

我正在阅读 VGG16 论文very deep convolutional networks for large-scale image recognition

在 3.2 测试中，它谈到所有 全连接层 被一些 CNN 层

取代

Namely, the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers). The resulting fully-convolutional net is then applied to the whole (uncropped) image. The result is a class score map with the number of channels equal to the number of classes, and a variable spatial resolution, dependent on the input image size. Finally, to obtain a fixed-size vector of class scores for the image, the class score map is spatially averaged (sum-pooled)

因此在测试集上预测时 VGG16（配置 D）的架构将是

input=(224, 224)
conv2d(64, (3,3))
conv2d(64, (3,3))
Maxpooling(2, 2)
conv2d(128, (3,3))
conv2d(128, (3,3))
Maxpooling(2, 2)
conv2d(256, (3,3))
conv2d(256, (3,3))
conv2d(256, (3,3))
Maxpooling(2, 2)
conv2d(512, (3,3))
conv2d(512, (3,3))
conv2d(512, (3,3))
Maxpooling(2, 2)
conv2d(512, (3,3))
conv2d(512, (3,3))
conv2d(512, (3,3))
Maxpooling(2, 2)
Dense(4096) is replaced by conv2d((7, 7))
Dense(4096) is replaced by conv2d((1, 1))
Dense(1000) is replaced by conv2d((1, 1))

所以这个架构只用于测试集？

最后 3 个 CNN 层 是否都有 1000 个通道 ？

The result is a class score map with the number of channels equal to the number of classes

由于输入大小为 224*224，因此 最后一个 Maxpooling 层之后的输出大小将为 (7 * 7) .为什么说可变空间分辨率？我知道它有多重 class 比例，但它会在输入前被裁剪成 (224, 224) 图像。

以及 VGG16 如何获得 (1000, ) 向量？这里的空间平均（总和）是多少？它只是添加一个大小为 (7, 7) 的 sum pooling layer 以获得 (1, 1, 1000) 数组?

the class score map is spatially averaged (sum-pooled)

在 3.2 测试中

Also, multi-crop evaluation is complementary to dense evaluation due to different convolution boundary conditions: when applying a ConvNet to a crop, the convolved feature maps are padded with zeros, while in the case of dense evaluation the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context is captured.

所以 multi-crop 和 dense 评估将仅用于验证集？

假设输入尺寸为 (256, 256)，multi-crop 可能得到尺寸为 (224, 224) 的图像，其中裁剪图像的中心可能不同，比如 [0:223, 0:223] 或[1:224, 1:224]。我对 multi-crop 的理解正确吗？

什么是密集评价？我正在尝试 google 他们，但无法获得相关结果。

Answer 1

将dense层改为卷积层的主要思想是让推理输入图像大小独立。假设你有 (224,224) 大小的图像，那么你的网络与 FC 会很好地工作，但是一旦图像大小改变，你的网络就会开始抛出 大小不匹配error（这意味着您的网络取决于图像大小）。

因此，为了解决这些问题，我们制作了一个完整的卷积网络，其中特征存储在通道中，而图像的大小是平均的，使用平均池化层甚至是这个维度的卷积步骤(频道=number_of_classification 类,1,1)。所以当你把最后一个结果拉平时，它会变成 *number_of_classes = channel*1*1.*

我没有为此附上完整的代码，因为您的完整问题将需要更详细的答案，同时定义许多基础知识。我鼓励你阅读完整的连接卷积网络来理解这个想法。这很简单，我 100% 肯定你会理解其中的本质。

VGGnet 的架构。什么是多作物、密集评估？

Architecture of VGGnet. What is multi-crop, dense evaluation?

computer-vision

deep-learning

conv-neural-network

vgg-net