Single-shot Multibox Detection:编码训练数据时方差的使用
Single-shot Multibox Detection: usage of variance when encoding data for training
在代码中实现 Single-shot multibox detector 时,我无法理解 "variance" 的概念。
我正在阅读 this and this 个存储库。
训练时,位置输入数据是默认框(锚框、先验框)坐标(Δcx、Δcy、Δw、Δh)相对于真实边界框坐标的增量编码坐标。
我不明白的部分是当它编码 0.1 到 Δcx 和 Δcy,以及 0.2 到 Δw 和 Δh。
为什么这是必要的?或者我应该问,这对训练结果有什么影响?
我也研究了原始的 caffe 实现,但在那里找不到太多解释,而不是它们在训练时被编码并重新用于解码以进行推理。
我没有太多的数学背景,但欢迎任何关于数学理论的建议link等。
提前致谢!
有a thread discussing this in the original caffe implementation and one of the repositories I was working on here。
SSD论文作者says:
You can think of it as approximating a gaussian distribution for adjusting the prior box. Or you can think of it as scaling the localization gradient. Variance is also used in original MultiBox and Fast(er) R-CNN.
我正在处理的 repo 的作者 says:
Probably, the naming comes from the idea, that the ground truth bounding boxes are not always precise, in other words, they vary from image to image probably for the same object in the same position just because human labellers cannot ideally repeat themselves. Thus, the encoded values are some random values, and we want them to have unit variance that is why we divide by some value. Why they are initialized to the values used in the code - I've no idea, probably some empirical estimation by the authors.
我也在想同样的问题,为什么除法和乘法总是需要固定方差?另外,如果我们在没有 "encoding" 和 "decoding" 这一步的情况下直接回归,会不会对训练有那么大的影响?
在代码中实现 Single-shot multibox detector 时,我无法理解 "variance" 的概念。
我正在阅读 this and this 个存储库。
训练时,位置输入数据是默认框(锚框、先验框)坐标(Δcx、Δcy、Δw、Δh)相对于真实边界框坐标的增量编码坐标。
我不明白的部分是当它编码 0.1 到 Δcx 和 Δcy,以及 0.2 到 Δw 和 Δh。
为什么这是必要的?或者我应该问,这对训练结果有什么影响?
我也研究了原始的 caffe 实现,但在那里找不到太多解释,而不是它们在训练时被编码并重新用于解码以进行推理。
我没有太多的数学背景,但欢迎任何关于数学理论的建议link等。
提前致谢!
有a thread discussing this in the original caffe implementation and one of the repositories I was working on here。
SSD论文作者says:
You can think of it as approximating a gaussian distribution for adjusting the prior box. Or you can think of it as scaling the localization gradient. Variance is also used in original MultiBox and Fast(er) R-CNN.
我正在处理的 repo 的作者 says:
Probably, the naming comes from the idea, that the ground truth bounding boxes are not always precise, in other words, they vary from image to image probably for the same object in the same position just because human labellers cannot ideally repeat themselves. Thus, the encoded values are some random values, and we want them to have unit variance that is why we divide by some value. Why they are initialized to the values used in the code - I've no idea, probably some empirical estimation by the authors.
我也在想同样的问题,为什么除法和乘法总是需要固定方差?另外,如果我们在没有 "encoding" 和 "decoding" 这一步的情况下直接回归,会不会对训练有那么大的影响?