为什么 ffmpeg 在转换为 gbrp 和 rgb24 时输出的 RGB 值略有不同？

Question

通过使用以下命令行之一，可以将视频流转换为 RGB 缓冲区：

ffmpeg -i video.mp4 -frames 1 -color_range pc -f rawvideo -pix_fmt rgb24 output.rgb24
ffmpeg -i video.mp4 -frames 1 -color_range pc -f rawvideo -pix_fmt gbrp output.gbrp

然后可以读取这些 RGB 缓冲区，例如使用 Python 和 NumPy：

import numpy as np


def load_buffer_gbrp(path, width=1920, height=1080):
    """Load a gbrp 8-bit raw buffer from a file"""
    data = np.frombuffer(open(path, "rb").read(), dtype=np.uint8)
    data_gbrp = data.reshape((3, height, width))
    img_rgb = np.empty((height, width, 3), dtype=np.uint8)
    img_rgb[..., 0] = data_gbrp[2, ...]
    img_rgb[..., 1] = data_gbrp[0, ...]
    img_rgb[..., 2] = data_gbrp[1, ...]
    return img_rgb


def load_buffer_rgb24(path, width=1920, height=1080):
    """Load an rgb24 8-bit raw buffer from a file"""
    data = np.frombuffer(open(path, "rb").read(), dtype=np.uint8)
    img_rgb = data.reshape((height, width, 3))
    return img_rgb


buffer_rgb24 = load_buffer_rgb24("output.rgb24")
buffer_gbrp = load_buffer_gbrp("output.gbrp")

理论上，两个输出应该具有相同的RGB值（只是内存中的布局应该不同）；在现实世界中，情况并非如此：

import matplotlib.pyplot as plt

diff = buffer_rgb24.astype(float) - buffer_gbrp.astype(float)
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, constrained_layout=True, figsize=(12, 2.5))
ax1.imshow(buffer_rgb24)
ax1.set_title("rgb24")
ax2.imshow(buffer_gbrp)
ax2.set_title("gbrp")
im = ax3.imshow(diff[..., 1], vmin=-5, vmax=+5, cmap="seismic")
ax3.set_title("difference (green channel)")
plt.colorbar(im, ax=ax3)
plt.show()

转换后的帧的差异大于色度子采样或舍入误差所能解释的差异（差异约为 2-3，舍入误差将小于 1），而且更糟糕的是，似乎具有统一的对整个图像有偏见。

为什么会这样，哪些 ffmpeg 参数会影响此行为？

Answer 1

以下内容让我疯狂地追逐各种 ffmpeg 选项，但据我所知，所有这些都没有真正记录下来，所以我希望它对像我一样困惑的其他人有用这些相当神秘的行为。

差异是由libswscale的默认参数造成的，负责从YUV转换为RGB的ffmpeg组件；特别是，添加 full_chroma_int+bitexact+accurate_rnd 标志消除了帧之间的差异：

ffmpeg -i video.mp4 -frames 1 -color_range pc -f rawvideo -pix_fmt rgb24 -sws_flags full_chroma_int+bitexact+accurate_rnd output_good.rgb24
ffmpeg -i video.mp4 -frames 1 -color_range pc -f rawvideo -pix_fmt gbrp -sws_flags full_chroma_int+bitexact+accurate_rnd output_good.gbrp

请注意，各种视频论坛将这些标志（或其子集）吹捧为“更好”，但没有真正提供解释，这并不能真正让我满意。对于这里的问题，他们确实更好，让我们看看如何。

首先，新输出与默认选项的 gbrp 输出一致，这是个好消息！

buffer_rgb24_good = load_buffer_rgb24("output_good.rgb24")
buffer_gbrp_good = load_buffer_gbrp("output_good.gbrp")

diff1 = buffer_rgb24_good.astype(float) - buffer_gbrp.astype(float)
diff2 = buffer_gbrp_good.astype(float) - buffer_gbrp.astype(float)
fig, (ax1, ax2) = plt.subplots(ncols=2, constrained_layout=True, figsize=(8, 2.5))
ax1.imshow(diff1[..., 1], vmin=-5, vmax=+5, cmap="seismic")
ax1.set_title("rgb24 (new) - gbrp (default)")
im = ax2.imshow(diff2[..., 1], vmin=-5, vmax=+5, cmap="seismic")
ax2.set_title("gbrp (new) - gbrp (default)")
plt.colorbar(im, ax=ax2)
plt.show()

ffmpeg 源代码在内部使用以下函数进行 libswscale/output.c 中的转换：

yuv2rgb_full_1_c_template（和其他变体）用于 rgb24 和 full_chroma_int
yuv2rgb_1_c_template（和其他变体）对于 rgb24 没有 full_chroma_int
yuv2gbrp_full_X_c（和其他变体）用于 gbrp，独立于 full_chroma_int

一个重要的结论是 full_chroma_int 参数似乎在 gbrp 格式中被忽略，但在 rgb24 中却没有，它是统一偏差的主要原因。

请注意，在非 rawvideo 输出中，ffmpeg 可以 select 支持的像素格式，具体取决于 selected 格式，因此在任何情况下都可能默认获得用户知道它。

另一个问题是：这些值是否正确？换句话说，两者是否有可能以相同的方式存在偏见？采用 colour-science Python 包，我们可以使用与 ffmpeg 不同的实现将 YUV 数据转换为 RGB 以获得更多信心。

Ffmpeg 可以输出原生格式的原始 YUV 帧，只要您知道它们的布局方式就可以对其进行解码。

$ ffmpeg -i video.mp4 -frames 1 -f rawvideo -pix_fmt yuv444p output.yuv
...
Output #0, rawvideo, to 'output.yuv':
...
 Stream #0:0(und): Video: rawvideo... yuv444p

我们可以用 Python:

读取

def load_buffer_yuv444p(path, width=1920, height=1080):
    """Load an yuv444 8-bit raw buffer from a file"""
    data = np.frombuffer(open(path, "rb").read(), dtype=np.uint8)
    img_yuv444 = np.moveaxis(data.reshape((3, height, width)), 0, 2)
    return img_yuv444

buffer_yuv = load_buffer_yuv444p("output.yuv")

然后这个可以转成RGB:

import colour

rgb_ref = colour.YCbCr_to_RGB(buffer_yuv, colour.WEIGHTS_YCBCR["ITU-R BT.709"], in_bits=8, in_legal=True, in_int=True, out_bits=8, out_legal=False, out_int=True)

...并作为参考：

diff1 = buffer_rgb24_good.astype(float) - rgb_ref.astype(float)
diff2 = buffer_gbrp_good.astype(float) - rgb_ref.astype(float)
diff3 = buffer_rgb24.astype(float) - rgb_ref.astype(float)
diff4 = buffer_gbrp.astype(float) - rgb_ref.astype(float)
fig, axes = plt.subplots(ncols=2, nrows=2, constrained_layout=True, figsize=(8, 5))
im = axes[0, 0].imshow(diff1[..., 1], vmin=-5, vmax=+5, cmap="seismic")
axes[0, 0].set_title("rgb24 (new) - reference")
im = axes[0, 1].imshow(diff2[..., 1], vmin=-5, vmax=+5, cmap="seismic")
axes[0, 1].set_title("gbrp (new) - reference")
im = axes[1, 0].imshow(diff3[..., 1], vmin=-5, vmax=+5, cmap="seismic")
axes[1, 0].set_title("rgb24 (default) - reference")
im = axes[1, 1].imshow(diff4[..., 1], vmin=-5, vmax=+5, cmap="seismic")
axes[1, 1].set_title("gbrp (default) - reference")
plt.show()

由于插值方法和舍入误差略有不同，但没有统一的偏差，因此仍存在差异，因此两种实现方式基本一致。

(注：本例中output.yuv文件在yuv444p中，由ffmpeg在上面command-line 而无需进行完整的 RGB 到 YUV 转换。更完整的测试将从单个原始 YUV 帧而不是常规视频进行所有先前的转换，以更好地隔离差异。）

Answer 2

到目前为止分析得很好。让我尝试从 swscale 方面添加一些观点，希望这有助于进一步解释您所看到的差异以及它们在技术上的来源。

您看到的差异确实是四舍五入不同造成的。这些差异不是因为 rgb24/gbrp 根本不同（它们是同一基本数据类型的不同布局），而是因为实现是由不同的人在不同的时间为不同的用例编写的。

yuv420p-to-rgb24（反之亦然）是非常非常古老的实现，在 swscale 成为 FFmpeg 的一部分之前就已经存在。这些实现具有 MMX (!) 优化，并针对奔腾机器 (!) 上的最佳转换进行了优化。这是 90 年代中期左右的技术。这里的想法是在 YUV 输出之前转换 JPEG 和 MPEG-1 to/from monitor-compatible 输出。 MMX 优化实际上 well-tuned 在他们的时代。

您可以想象速度在这里至关重要（当时，YUV-to-rgb24 转换很慢并且是整个显示管道的主要组成部分）。 YUV-to-RGB是简单的matrix multiplication (with coefficients depending on what the exact YUV colorspace is). However, the resolution of the UV planes are different from the Y & RGB planes. In the simple (non-exact) yuv-to-rgb24 conversion, the UV are upsampled using next-neighbour conversion, so each RGB[x,y] uses Y[x,y] and UV[x/2,y/2] as input, or in other words, UV input samples are re-used 2x2 times for each output RGB pixel. The flag full_chroma_int "undoes" this optimization/shortcut. This means the chroma plane is upsampled using actual scaling conversions before the YUV-to-RGB conversion is initiated, and this upsampling can use filters such as bilinear, bicubic or even more advanced/expensive kernels (e.g. lanczos, sinc or spline).

bitexact 是 FFmpeg 中的通用术语，用于禁用不会生成与 C 函数完全相同的输出的 SIMD 优化。除了说明它的含义之外，我暂时忽略它。

最后，accurate_rnd：如果我没记错的话，这里的想法是在矩阵乘法中（与你是否使用色度平面上采样无关），integer-equivalent的典型方法floating-point r = v*coef1 + y 的给定精度（例如使用 15 位系数）是 r = y + ((v*coef1 + 0x4000) >> 15)。但是，在 x86 SIMD 中，这需要您使用仅在 SSSE3 中可用的指令 pmulhrsw，在 MMX 中不可用。此外，这意味着对于 g = u*coef2 + v*coef3 + y 你需要 pmaddwd 和 round/shift 使用单独的指令。因此，MMX SIMD 改为使用 pmulhw（pmulhrsw 的未舍入版本），这基本上使它成为 r = y + (v*coef1>>16)（使用 16 位系数）。这在数学上非常接近，但不那么精确，尤其是对于 G 像素（因为它将 g = (u*coef2 + v * coef3 + 0x8000) >> 16) + y 变成 g = (u*coef2>>16) + (v*coef3>>16) + y）。 accurate_rnd“撤消”此 optimization/shortcut。

现在，YUV-to-gbrp。 GBR-planar 是为 H264 RGB 支持添加的，因为 H264 将 RGB 编码为“只是另一个”YUV 变体，但 G 在 Y 平面中等。您可以想象速度不再是问题，MMX 支持也是如此。所以在这里，数学是正确完成的。事实上，如果我没记错的话，accurate_rnd 只是在之后添加的，所以 YUV-to-rgb24 可以输出与 YUV-to-gbrp 相同的像素并使两个输出相等，但代价是不能够使用 swscale 合并到 FFmpeg 时继承的（旧）MMX 优化。默认情况下，这会使用 user-configured 缩放内核正确地进行上采样，因为平面转换只会在所有 YUV 平面具有相同大小时才进行，也就是说，它严格地只进行矩阵乘法。这是大约在 2015 年左右添加的，所以我们谈论的是计算机编程术语中的永恒。

如今，从 YUV-to-rgb24 等“不精确”的实现中获得的性能提升与不精确舍入和缺乏可配置的色度平面缩放导致的实际质量损失相比，被认为是不值得的。这就是为什么大多数人会推荐你使用 -sws_flags accurate_rnd+full_chroma_int 的原因。此外，现在有针对“较慢”转换路径的 x86 SIMD（SSSE3 和 AVX2）实现，而在 2010 年左右，这都是直接的 C 代码，没有人愿意花时间来优化它。我猜测 -sws_flags accurate_rnd+full_chroma_int 的性能会比“快速”YUV-to-rgb24 转换稍差，因为它分两步而不是一步进行色度上采样和矩阵乘法。但是在现代 x86 硬件上，这种性能损失应该是最小的并且可以接受，除非你真的很严重 resource-constrained.

希望一切都有意义。

为什么 ffmpeg 在转换为 gbrp 和 rgb24 时输出的 RGB 值略有不同？

Why does ffmpeg output slightly different RGB values when converting to gbrp and rgb24?

rgb

ffmpeg

numpy

yuv

swscale