使用 OpenCV 对视频进行 Alpha 混合

Question

我想使用 alpha 视频将一个视频叠加在另一个视频之上。这是我的代码。它工作得很好，但问题是这段代码根本没有效率，这是因为 /255 部分。它很慢并且有滞后问题。

有没有标准有效的方法来做到这一点？我希望结果是实时的。谢谢

import cv2
import numpy as np

def main():
    foreground = cv2.VideoCapture('circle.mp4')
    background = cv2.VideoCapture('video.MP4')
    alpha = cv2.VideoCapture('circle_alpha.mp4')

    while foreground.isOpened():
        fr_foreground = foreground.read()[1]/255
        fr_background = background.read()[1]/255     
        fr_alpha = alpha.read()[1]/255

        cv2.imshow('My Image',cmb(fr_foreground,fr_background,fr_alpha))

        if cv2.waitKey(1) == ord('q'): break

    cv2.destroyAllWindows

def cmb(fg,bg,a):
    return fg * a + bg * (1-a)

if __name__ == '__main__':
    main()

Answer 1

让我们先解决几个明显的问题 - foreground.isOpened() 即使在视频结束后仍会 return 为真，因此您的程序最终会在该点崩溃。解决方案是双重的。首先，在创建它们后立即测试所有 3 个 VideoCapture 实例，使用类似的东西：

if not foreground.isOpened() or not background.isOpened() or not alpha.isOpened():
    print "Unable to open input videos."
    return

这将确保它们都正确打开。下一部分是正确处理到达视频结尾的部分。这意味着要么检查 read() 的两个 return 值中的第一个，这是一个代表成功的布尔标志，要么测试帧是否为 None.

while True:
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video

此外，您似乎实际上并没有调用 cv2.destroyAllWindows() -- () 丢失了。并不是说这真的很重要。

为了帮助调查和优化这一点，我使用 timeit 模块和几个方便的函数添加了一些详细的时间安排

from timeit import default_timer as timer

def update_times(times, total_times):
    for i in range(len(times) - 1):
        total_times[i] += (times[i+1]-times[i]) * 1000

def print_times(total_times, n):
    print "Iterations: %d" % n
    for i in range(len(total_times)):
        print "Step %d: %0.4f ms" % (i, total_times[i] / n)
    print "Total: %0.4f ms" % (np.sum(total_times) / n)

并修改了 main() 函数以测量每个逻辑步骤所花费的时间 -- 读取、缩放、混合、显示、waitKey。为此，我将部门拆分为单独的语句。我还做了一些小修改，使这项工作在 Python 2.x 中也有效（/255 被插入为整数除法并产生错误的结果）。

times = [0.0] * 6
total_times = [0.0] * (len(times) - 1)
n = 0
while True:
    times[0] = timer()
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video
    times[1] = timer()
    fr_foreground = fr_foreground / 255.0
    fr_background = fr_background / 255.0
    fr_alpha = fr_alpha / 255.0
    times[2] = timer()
    result = cmb(fr_foreground,fr_background,fr_alpha)
    times[3] = timer()
    cv2.imshow('My Image', result)
    times[4] = timer()
    if cv2.waitKey(1) == ord('q'): break
    times[5] = timer()
    update_times(times, total_times)
    n += 1

print_times(total_times, n)

当我运行将 1280x800 mp4 视频作为输入时，我注意到它确实相当缓慢，而且它在我的 6 核机器上只使用了 15% CPU。各板块时间安排如下：

Iterations: 1190
Step 0: 11.4385 ms
Step 1: 37.1320 ms
Step 2: 39.4083 ms
Step 3: 2.5488 ms
Step 4: 10.7083 ms
Total: 101.2358 ms

这表明最大的瓶颈是缩放步骤和混合步骤。 CPU 的低使用率也不是最理想的，但让我们先关注容易实现的目标。

让我们看看我们使用的numpy数组的数据类型。 read() 为我们提供了 dtype of np.uint8 的数组——8 位无符号整数。但是，浮点除法（如所写）将产生一个 dtype of np.float64 的数组——64 位浮点值。我们的算法实际上不需要这种精度级别，所以我们最好只使用 32 位浮点数——这意味着如果任何操作被向量化，我们可能会在相同的时间内进行两倍的计算时间量。

这里有两个选项。我们可以简单地将除数转换为 np.float32，这将导致 numpy 为我们提供相同的结果 dtype:

fr_foreground = fr_foreground / np.float32(255.0)
fr_background = fr_background / np.float32(255.0)
fr_alpha = fr_alpha / np.float32(255.0)

这给了我们以下时间：

Iterations: 1786
Step 0: 9.2550 ms
Step 1: 19.0144 ms
Step 2: 21.2120 ms
Step 3: 1.4662 ms
Step 4: 10.8889 ms
Total: 61.8365 ms

或者我们可以先将数组转换为 np.float32，然后就地进行缩放。

fr_foreground = np.float32(fr_foreground)
fr_background = np.float32(fr_background)
fr_alpha = np.float32(fr_alpha)

fr_foreground /= 255.0
fr_background /= 255.0
fr_alpha /= 255.0

它给出了以下时间安排（将步骤 1 拆分为转换 (1) 和缩放 (2) -- 其余移位 1）：

Iterations: 1786
Step 0: 9.0589 ms
Step 1: 13.9614 ms
Step 2: 4.5960 ms
Step 3: 20.9279 ms
Step 4: 1.4631 ms
Step 5: 10.4396 ms
Total: 60.4469 ms

两者大致相同，运行在原始时间的 60% 左右。我将坚持使用第二个选项，因为它将在后面的步骤中发挥作用。让我们看看我们还能改进什么。

从前面的时间来看，我们可以看出缩放不再是瓶颈，但我仍然想到一个想法——除法通常比乘法慢，那么如果我们乘以一个倒数呢？

fr_foreground *= 1/255.0
fr_background *= 1/255.0
fr_alpha *= 1/255.0

确实，这确实为我们节省了一毫秒 -- 没什么了不起的，但它很容易，所以不妨继续：

Iterations: 1786
Step 0: 9.1843 ms
Step 1: 14.2349 ms
Step 2: 3.5752 ms
Step 3: 21.0545 ms
Step 4: 1.4692 ms
Step 5: 10.6917 ms
Total: 60.2097 ms

现在混合函数是最大的瓶颈，其次是所有3个数组的类型转换。如果我们看一下混合操作的作用：

foreground * alpha + background * (1.0 - alpha)

我们可以观察到，为了让数学起作用，唯一需要在范围 (0.0, 1.0) 内的值是 alpha。

如果我们只缩放 alpha 图像会怎么样？另外，由于乘以浮点数会提升为浮点数，如果我们也跳过类型转换会怎样？这意味着 cmb() 必须 return np.uint8 array

def cmb(fg,bg,a):
    return np.uint8(fg * a + bg * (1-a))

我们会

    #fr_foreground = np.float32(fr_foreground)
    #fr_background = np.float32(fr_background)
    fr_alpha = np.float32(fr_alpha)

    #fr_foreground *= 1/255.0
    #fr_background *= 1/255.0
    fr_alpha *= 1/255.0

这个时间是

Step 0: 7.7023 ms
Step 1: 4.6758 ms
Step 2: 1.1061 ms
Step 3: 27.3188 ms
Step 4: 0.4783 ms
Step 5: 9.0027 ms
Total: 50.2840 ms

显然，第 1 步和第 2 步要快得多，因为我们只做了 1/3 的工作。 imshow 也加快了速度，因为它不必从浮点数转换。令人费解的是，读取速度也变得更快（我想我们正在避免一些幕后重新分配，因为 fr_foreground 和 fr_background 总是包含原始帧）。我们确实在 cmb() 中付出了额外演员表的代价，但总的来说这似乎是一场胜利——我们只用了原来时间的 50%。

为了继续，让我们摆脱 cmb() 功能，将其功能移至 main() 并将其拆分以衡量每个操作的成本。让我们也尝试重用 alpha.read() 的结果（因为我们最近看到 read() 性能有所提高）：

times = [0.0] * 11
total_times = [0.0] * (len(times) - 1)
n = 0
while True:
    times[0] = timer()
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha_raw = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video

    times[1] = timer()
    fr_alpha = np.float32(fr_alpha_raw)
    times[2] = timer()
    fr_alpha *= 1/255.0
    times[3] = timer()
    fr_alpha_inv = 1.0 - fr_alpha
    times[4] = timer()
    fr_fg_weighed = fr_foreground * fr_alpha
    times[5] = timer()
    fr_bg_weighed = fr_background * fr_alpha_inv
    times[6] = timer()
    sum = fr_fg_weighed + fr_bg_weighed
    times[7] = timer()
    result = np.uint8(sum)
    times[8] = timer()
    cv2.imshow('My Image', result)
    times[9] = timer()
    if cv2.waitKey(1) == ord('q'): break
    times[10] = timer()
    update_times(times, total_times)
    n += 1

新时间：

Iterations: 1786
Step 0: 6.8733 ms
Step 1: 5.2742 ms
Step 2: 1.1430 ms
Step 3: 4.5800 ms
Step 4: 7.0372 ms
Step 5: 7.0675 ms
Step 6: 5.3082 ms
Step 7: 2.6912 ms
Step 8: 0.4658 ms
Step 9: 9.6966 ms
Total: 50.1372 ms

我们并没有真正获得任何收获，但读取速度明显加快。

这引出了另一个想法——如果我们尝试最小化分配并在后续迭代中重用数组会怎样？

我们可以在第一次迭代中预先分配必要的数组（使用numpy.zeros_like），在我们读取第一组帧之后：

if n == 0: # Pre-allocate
    fr_alpha = np.zeros_like(fr_alpha_raw, np.float32)
    fr_alpha_inv = np.zeros_like(fr_alpha_raw, np.float32)
    fr_fg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
    fr_bg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
    sum = np.zeros_like(fr_alpha_raw, np.float32)
    result = np.zeros_like(fr_alpha_raw, np.uint8)

现在，我们可以使用

numpy.add加法
numpy.subtract 减法
numpy.multiply 乘法
numpy.copyto 进行类型转换

我们还可以使用单个 numpy.multiply.

将步骤 1 和步骤 2 合并在一起

times = [0.0] * 10
total_times = [0.0] * (len(times) - 1)
n = 0
while True:
    times[0] = timer()
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha_raw = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video

    if n == 0: # Pre-allocate
        fr_alpha = np.zeros_like(fr_alpha_raw, np.float32)
        fr_alpha_inv = np.zeros_like(fr_alpha_raw, np.float32)
        fr_fg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
        fr_bg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
        sum = np.zeros_like(fr_alpha_raw, np.float32)
        result = np.zeros_like(fr_alpha_raw, np.uint8)

    times[1] = timer()
    np.multiply(fr_alpha_raw, np.float32(1/255.0), fr_alpha)
    times[2] = timer()
    np.subtract(1.0, fr_alpha, fr_alpha_inv)
    times[3] = timer()
    np.multiply(fr_foreground, fr_alpha, fr_fg_weighed)
    times[4] = timer()
    np.multiply(fr_background, fr_alpha_inv, fr_bg_weighed)
    times[5] = timer()
    np.add(fr_fg_weighed, fr_bg_weighed, sum)
    times[6] = timer()
    np.copyto(result, sum, 'unsafe')
    times[7] = timer()
    cv2.imshow('My Image', result)
    times[8] = timer()
    if cv2.waitKey(1) == ord('q'): break
    times[9] = timer()
    update_times(times, total_times)
    n += 1

这为我们提供了以下时间：

Iterations: 1786
Step 0: 7.0515 ms
Step 1: 3.8839 ms
Step 2: 1.9080 ms
Step 3: 4.5198 ms
Step 4: 4.3871 ms
Step 5: 2.7576 ms
Step 6: 1.9273 ms
Step 7: 0.4382 ms
Step 8: 7.2340 ms
Total: 34.1074 ms

我们修改的所有步骤都有显着改进。我们减少了原始实施所需时间的 35%。

次要更新：

同样基于Silencer's I measured cv2.convertScaleAbs。它实际上运行s 有点快：

Step 6: 1.2318 ms

这给了我另一个想法——我们可以利用 cv2.add 来指定目标数据类型并进行饱和转换。这将使我们能够将步骤 5 和 6 结合在一起。

cv2.add(fr_fg_weighed, fr_bg_weighed, result, dtype=cv2.CV_8UC3)

出现在

Step 5: 3.3621 ms

又赢了一点（之前我们大约是 3.9 毫秒）。

在此基础上，cv2.subtract and cv2.multiply是进一步的候选人。我们需要使用 4 元素元组来定义标量（Python 绑定的复杂性），并且我们需要显式定义乘法的输出数据类型。

    cv2.subtract((1.0, 1.0, 1.0, 0.0), fr_alpha, fr_alpha_inv)
    cv2.multiply(fr_foreground, fr_alpha, fr_fg_weighed, dtype=cv2.CV_32FC3)
    cv2.multiply(fr_background, fr_alpha_inv, fr_bg_weighed, dtype=cv2.CV_32FC3)

时间安排：

Step 2: 2.1897 ms
Step 3: 2.8981 ms
Step 4: 2.9066 ms

这似乎是我们在没有并行化的情况下所能达到的极限。我们已经利用了 OpenCV 在单个操作方面可能提供的任何优势，因此我们应该专注于流水线化我们的实现。

为了帮助我弄清楚如何在不同的流水线阶段（线程）之间划分代码，我制作了一个图表来显示所有操作、我们对它们的最佳时间以及计算的相互依赖性:

WIP 在我写这篇文章的时候查看评论以获取更多信息。

Answer 2

我正在使用 OpenCV 4.00-pre 和 Python 3.6。

There is no need to do three xxx/255 ops. Just for alpha is ok.

Take care of the type convertion, prefer cv2.convertScaleAbs(xxx) other than np.uint8(xxx) or np.copyto(xxx,yyy, "unsafe").

Preallocate the memory should be better.

我使用#2，即cv2.convertScaleAbs来避免underflow/overflow，范围在[0,255]。例如：

>>> x = np.array([[-1,256]])
>>> y = np.uint8(x)
>>> z = cv2.convertScaleAbs(x)
>>> x
array([[ -1, 256]])
>>> y
array([[255,   0]], dtype=uint8)
>>> z
array([[  1, 255]], dtype=uint8)

##! 2018/05/09 13:54:34

import cv2
import numpy as np
import time

def cmb(fg,bg,a):
    return fg * a + bg * (1-a)

def test2():
    cap = cv2.VideoCapture(0)
    ret, prev_frame = cap.read()
    """
    foreground = cv2.VideoCapture('circle.mp4')
    background = cv2.VideoCapture('video.MP4')
    alphavideo = cv2.VideoCapture('circle_alpha.mp4')
    """
    while cap.isOpened():
        ts = time.time()
        ret, fg = cap.read()
        alpha = fg.copy()
        bg = prev_frame
        """
        ret, fg = foreground.read()
        ret, bg = background.read()
        ret, alpha = alphavideo.read()
        """

        alpha = np.multiply(alpha, 1.0/255)
        blended = cv2.convertScaleAbs(cmb(fg, bg, alpha))
        te = time.time()
        dt = te-ts
        fps = 1/dt
        print("{:.3}ms, {:.3} fps".format(1000*dt, fps))
        cv2.imshow('Blended', blended)

        if cv2.waitKey(1) == ord('q'):
            break

    cv2.destroyAllWindows()

if __name__ == "__main__":
    test2()

像这样的一些输出：

39.0ms, 25.6 fps
37.0ms, 27.0 fps
38.0ms, 26.3 fps
37.0ms, 27.0 fps
38.0ms, 26.3 fps
37.0ms, 27.0 fps
38.0ms, 26.3 fps
37.0ms, 27.0 fps
37.0ms, 27.0 fps
37.0ms, 27.0 fps
37.0ms, 27.0 fps
38.0ms, 26.3 fps
37.0ms, 27.0 fps
37.0ms, 27.0 fps
37.0ms, 27.0 fps
37.0ms, 27.0 fps
...

Answer 3

如果只是混合、渲染然后忘记，那么在 GPU 上完成它是有意义的。在许多其他项目中，VTK（可视化工具包）（https://www.vtk.org ) can do this for you instead of imshow. VTK is already known from OpenCV 3D Visualizer-module (https://docs.opencv.org/3.2.0/d1/d19/group__viz.html）不应增加太多依赖性。

此后，整个计算部分（不包括读取视频帧）归结为 cv2.mixChannels 并将像素数据传输到两个渲染器，在我的计算机上，对于 1280x720 视频，每次迭代大约需要 5 毫秒。

import sys
import cv2
import numpy as np
import vtk
from vtk.util import numpy_support
import time

class Renderer:
    # VTK renderer with two layers
    def __init__( self ):
        self.layer1 = vtk.vtkRenderer()
        self.layer1.SetLayer(0)
        self.layer2 = vtk.vtkRenderer()
        self.layer2.SetLayer(1)
        self.renWin = vtk.vtkRenderWindow()
        self.renWin.SetNumberOfLayers( 2 )
        self.renWin.AddRenderer(self.layer1)
        self.renWin.AddRenderer(self.layer2)
        self.iren = vtk.vtkRenderWindowInteractor()
        self.iren.SetRenderWindow(self.renWin)
        self.iren.Initialize()      
    def Render( self ):
        self.iren.Render()

# set background image to a given renderer (resets the camera)
# from https://www.vtk.org/Wiki/VTK/Examples/Cxx/Images/BackgroundImage
def SetBackground( ren, image ):    
    bits = numpy_support.numpy_to_vtk( image.ravel() )
    bits.SetNumberOfComponents( image.shape[2] )
    bits.SetNumberOfTuples( bits.GetNumberOfTuples()/bits.GetNumberOfComponents() )

    img = vtk.vtkImageData()
    img.GetPointData().SetScalars( bits );
    img.SetExtent( 0, image.shape[1]-1, 0, image.shape[0]-1, 0,0 );
    origin = img.GetOrigin()
    spacing = img.GetSpacing()
    extent = img.GetExtent()

    actor = vtk.vtkImageActor()
    actor.SetInputData( img )

    ren.RemoveAllViewProps()
    ren.AddActor( actor )
    camera = vtk.vtkCamera()
    camera.ParallelProjectionOn()
    xc = origin[0] + 0.5*(extent[0] + extent[1])*spacing[0]
    yc = origin[1] + 0.5*(extent[2] + extent[3])*spacing[1]
    yd = (extent[3] - extent[2] + 1)*spacing[1]
    d = camera.GetDistance()
    camera.SetParallelScale(0.5*yd)
    camera.SetFocalPoint(xc,yc,0.0)
    camera.SetPosition(xc,yc,-d)
    camera.SetViewUp(0,-1,0)
    ren.SetActiveCamera( camera )
    return img

# update the scalar data without bounds check
def UpdateImageData( vtkimage, image ):
    bits = numpy_support.numpy_to_vtk( image.ravel() )
    bits.SetNumberOfComponents( image.shape[2] )
    bits.SetNumberOfTuples( bits.GetNumberOfTuples()/bits.GetNumberOfComponents() )
    vtkimage.GetPointData().SetScalars( bits );

r = Renderer()
r.renWin.SetSize(1280,720)
cap = cv2.VideoCapture('video.mp4')
image = cv2.imread('hello.png',1)
alpha = cv2.cvtColor(image,cv2.COLOR_RGB2GRAY )
ret, alpha = cv2.threshold( alpha, 127, 127, cv2.THRESH_BINARY )
alpha = np.reshape( alpha, (alpha.shape[0],alpha.shape[1], 1 ) )

src1=[]
src2=[]
overlay=[]
c=0
while ( 1 ):
    # read the data
    ret, mat = cap.read()
    if ( not ret ):
        break
    #TODO ret, image = cap2.read() #(rgb)
    #TODO ret, alpha = cap3.read() #(mono)

    # alpha blend
    t=time.time()
    if ( overlay==[] ):
        overlay = np.zeros( [image.shape[0],image.shape[1],4], np.uint8 ) 
    cv2.mixChannels( [image, alpha], [overlay], [0,0,1,1,2,2,3,3] )
    if ( src1==[] ):
        src1 = SetBackground( r.layer1, mat )
    else:
        UpdateImageData( src1, mat )
    if ( src2==[] ):
        src2 = SetBackground( r.layer2, overlay )
    else:
        UpdateImageData( src2, overlay )
    r.Render()
    # blending done
    t = time.time()-t;

    if ( c % 10 == 0 ):
        print 1000*t
    c = c+1;

使用 OpenCV 对视频进行 Alpha 混合

Alpha Blending using OpenCV for videos

python

opencv

alphablending

lag