为什么切片线程对使用 ffmpeg x264 的实时编码影响如此之大?
Why sliced thread affect so much on realtime encoding using ffmpeg x264?
我正在使用 ffmpeg libx264 对从 x11 实时捕获的 720p 屏幕进行编码,fps 为 30。
当我使用 -tune zerolatency 参数时,每帧的平均编码时间可以高达 12 毫秒,配置文件基线。
经过对ffmpeg x264源码的研究,我发现导致编码时间这么长的关键参数是sliced-threads,它是由-tune zerolatency启用的。使用 -x264-params sliced-threads=0 禁用后,编码时间可低至 2ms
并且在禁用切片线程的情况下,CPU 使用率为 40%,而启用时仅为 20%。
谁能解释一下这个切片线程的细节?特别是在实时编码中(假设没有帧被缓冲以进行编码。仅在捕获帧时进行编码)。
documentation 表明基于帧的线程比基于切片的线程具有更好的吞吐量。它还指出,由于编码器的部分是串行的,后者不能很好地扩展。
veryfast
配置文件(非实时)的加速与编码线程:
threads speedup psnr
slice frame slice frame
x264 --preset veryfast --tune psnr --crf 30
1: 1.00x 1.00x +0.000 +0.000
2: 1.41x 2.29x -0.005 -0.002
3: 1.70x 3.65x -0.035 +0.000
4: 1.96x 3.97x -0.029 -0.001
5: 2.10x 3.98x -0.047 -0.002
6: 2.29x 3.97x -0.060 +0.001
7: 2.36x 3.98x -0.057 -0.001
8: 2.43x 3.98x -0.067 -0.001
9: 3.96x +0.000
10: 3.99x +0.000
11: 4.00x +0.001
12: 4.00x +0.001
主要区别似乎是帧线程增加了帧延迟,因为它需要不同的帧来处理,而在基于切片的线程的情况下,所有线程都在同一帧上工作。在实时编码中,与离线相比,它需要等待更多帧到达以填充管道。
Normal threading, also known as frame-based threading, uses a clever staggered-frame system for parallelism. But it comes at a cost: as mentioned earlier, every extra thread requires one more frame of latency. Slice-based threading has no such issue: every frame is split into slices, each slice encoded on one core, and then the result slapped together to make the final frame. Its maximum efficiency is much lower for a variety of reasons, but it allows at least some parallelism without an increase in latency.
发件人:Diary of an x264 Developer
Sliceless threading: example with 2 threads.
Start encoding frame #0. When it's half done, start encoding frame #1. Thread #1 now only has access to the top half of its reference frame, since the rest hasn't been encoded yet. So it has to restrict the motion search range. But that's probably ok (unless you use lots of threads on a small frame), since it's pretty rare to have such long vertical motion vectors. After a little while, both threads have encoded one row of macroblocks, so thread #1 still gets to use motion range = +/- 1/2 frame height. Later yet, thread #0 finishes frame #0, and moves on to frame #2. Thread #0 now gets motion restrictions, and thread #1 is unrestricted.
发件人:http://web.archive.org/web/20150307123140/http://akuvian.org/src/x264/sliceless_threads.txt
因此,使用 -tune zereolatency
启用 sliced-threads
是有意义的,因为您需要尽快发送帧,而不是有效地对其进行编码(性能和质量方面)。
相反,使用太多线程会影响性能,因为维护它们的开销可能会超过潜在收益。
我正在使用 ffmpeg libx264 对从 x11 实时捕获的 720p 屏幕进行编码,fps 为 30。 当我使用 -tune zerolatency 参数时,每帧的平均编码时间可以高达 12 毫秒,配置文件基线。
经过对ffmpeg x264源码的研究,我发现导致编码时间这么长的关键参数是sliced-threads,它是由-tune zerolatency启用的。使用 -x264-params sliced-threads=0 禁用后,编码时间可低至 2ms
并且在禁用切片线程的情况下,CPU 使用率为 40%,而启用时仅为 20%。
谁能解释一下这个切片线程的细节?特别是在实时编码中(假设没有帧被缓冲以进行编码。仅在捕获帧时进行编码)。
documentation 表明基于帧的线程比基于切片的线程具有更好的吞吐量。它还指出,由于编码器的部分是串行的,后者不能很好地扩展。
veryfast
配置文件(非实时)的加速与编码线程:
threads speedup psnr
slice frame slice frame
x264 --preset veryfast --tune psnr --crf 30
1: 1.00x 1.00x +0.000 +0.000
2: 1.41x 2.29x -0.005 -0.002
3: 1.70x 3.65x -0.035 +0.000
4: 1.96x 3.97x -0.029 -0.001
5: 2.10x 3.98x -0.047 -0.002
6: 2.29x 3.97x -0.060 +0.001
7: 2.36x 3.98x -0.057 -0.001
8: 2.43x 3.98x -0.067 -0.001
9: 3.96x +0.000
10: 3.99x +0.000
11: 4.00x +0.001
12: 4.00x +0.001
主要区别似乎是帧线程增加了帧延迟,因为它需要不同的帧来处理,而在基于切片的线程的情况下,所有线程都在同一帧上工作。在实时编码中,与离线相比,它需要等待更多帧到达以填充管道。
Normal threading, also known as frame-based threading, uses a clever staggered-frame system for parallelism. But it comes at a cost: as mentioned earlier, every extra thread requires one more frame of latency. Slice-based threading has no such issue: every frame is split into slices, each slice encoded on one core, and then the result slapped together to make the final frame. Its maximum efficiency is much lower for a variety of reasons, but it allows at least some parallelism without an increase in latency.
发件人:Diary of an x264 Developer
Sliceless threading: example with 2 threads. Start encoding frame #0. When it's half done, start encoding frame #1. Thread #1 now only has access to the top half of its reference frame, since the rest hasn't been encoded yet. So it has to restrict the motion search range. But that's probably ok (unless you use lots of threads on a small frame), since it's pretty rare to have such long vertical motion vectors. After a little while, both threads have encoded one row of macroblocks, so thread #1 still gets to use motion range = +/- 1/2 frame height. Later yet, thread #0 finishes frame #0, and moves on to frame #2. Thread #0 now gets motion restrictions, and thread #1 is unrestricted.
发件人:http://web.archive.org/web/20150307123140/http://akuvian.org/src/x264/sliceless_threads.txt
因此,使用 -tune zereolatency
启用 sliced-threads
是有意义的,因为您需要尽快发送帧,而不是有效地对其进行编码(性能和质量方面)。
相反,使用太多线程会影响性能,因为维护它们的开销可能会超过潜在收益。