删除 for 循环以加快执行速度 - 矢量化
Remove for loops for faster execution - vectorize
作为我学术项目的一部分,我正在研究图像的线性滤波器。下面是代码,仅使用 NumPy(无外部库)并希望通过矢量化或任何其他选项消除 for 循环。如何实现矢量化以加快执行速度?感谢您的帮助。
输入-
- Image.shape - (568, 768)
- weightArray.shape - (3, 3)
def apply_filter(image: np.array, weight_array: np.array) -> np.array:
rows, cols = image.shape
height, width = weight_array.shape
output = np.zeros((rows - height + 1, cols - width + 1))
for rrow in range(rows - height + 1):
for ccolumn in range(cols - width + 1):
for hheight in range(height):
for wwidth in range(width):
imgval = image[rrow + hheight, ccolumn + wwidth]
filterval = weight_array[hheight, wwidth]
output[rrow, ccolumn] += imgval * filterval
return output
向量化是将每个显式 for
循环转换为 1 维 数组运算 的过程。
在 Python 中,这将涉及根据 slices.
重新构想您的数据
在下面的代码中,我提供了内核循环的工作矢量化。这显示了如何进行矢量化,但由于它仅优化 3x3 数组,因此不会为您提供最大的可用收益。
如果您想看到更大的改进,您将向量化图像数组,我也为您制作了模板——但留下一些作为练习。
import numpy as np
from PIL import Image
## no vectorization
def applyFilterMethod1(image: np.array, weightArray: np.array) -> np.array:
rows, cols = image.shape ; height, width = weightArray.shape
output = np.zeros((rows - height + 1, cols - width + 1))
for rrow in range(rows - height + 1):
for ccolumn in range(cols - width + 1):
for hheight in range(height):
for wwidth in range(width):
imgval = image[rrow + hheight, ccolumn + wwidth]
filterval = weightArray[hheight, wwidth]
output[rrow, ccolumn] += imgval * filterval
return output
## vectorize the kernel loop (~3x improvement)
def applyFilterMethod2(image: np.array, weightArray: np.array) -> np.array:
rows, cols = image.shape ; height, width = weightArray.shape
output = np.zeros((rows - height + 1, cols - width + 1))
for rrow in range(rows - height + 1):
for ccolumn in range(cols - width + 1):
imgval = image[rrow:rrow + height, ccolumn:ccolumn + width]
filterval = weightArray[:, :]
output[rrow, ccolumn] = sum(sum(imgval * filterval))
return output
## vectorize the image loop (~50x improvement)
def applyFilterMethod3(image: np.array, weightArray: np.array) -> np.array:
rows, cols = image.shape ; height, width = weightArray.shape
output = np.zeros((rows - height + 1, cols - width + 1))
for hheight in range(height):
for wwidth in range(width):
imgval = 0 ## TODO -- construct a compatible slice
filterval = weightArray[hheight, wwidth]
output[:, :] += imgval * filterval
return output
src = Image.open("input.png")
sb = np.asarray(src)
cb = np.array([[1,2,1],[2,4,2],[1,2,1]])
cb = cb/sum(sum(cb)) ## normalize
db = applyFilterMethod2(sb, cb)
dst = Image.fromarray(db)
dst.convert("L").save("output.png")
#src.show() ; dst.show()
注意:您可能会删除所有四个 for
循环,但会增加一些复杂性。但是,因为这只会消除 9 次迭代的开销(在本例中),所以我估计它不会比 applyFilterMethod3
产生任何额外的性能提升。此外,虽然我没有尝试过,但我想象的完成方式可能会增加比它会减少的更多的开销。
仅供参考:这是一个标准的图像卷积(仅支持实现的灰度)。我总是想指出,为了在数学上是正确的,这需要补偿几乎每个默认图像编码中隐含的 gamma compression——但这个小细节经常被忽略。
讨论
这种类型的向量化在 Python 中通常是必需的,具体来说,因为标准 Python 解释器是算法的 extremely inefficient at processing large for
loops. Explicitly iterating over each pixel of an image, therefore, wastes a lot time. Ultimately, though, the vectorized implementation does not change the amount of real work performed, so we're only talking about eliminating an overhead 方面。
然而,矢量化有一个副作用:Parallelization. Lumping a large amount of data processing onto a single operator gives the language/library more flexibility in how to optimize the execution. This might include executing your embarrassingly parallel operation on a GPU -- if you have the right tools, for example the Tensorflow image module。
Python 对 array programming 的无缝支持是它在机器学习中变得非常流行的原因之一,这可能是计算密集型的。
解决方案
这是 imgval
赋值的解决方案,上面留作练习。
imgval = image[hheight:hheight+rows - height+1, wwidth:wwidth+cols - width +1]
您可以构造图像的切片视图数组,每个切片视图移动权重数组的索引,然后将其乘以权重并求和。
def apply_filter(image: np.array, weights: np.array) -> np.array:
height, width = weights.shape
indices = np.indices(weights.shape).T.reshape(weights.size, 2)
views = np.array([image[r:-height+r,c:-width+c] for r, c in indices])
return np.inner(views.T, weights.T.flatten()).T # sum product
(我不得不在几个点上进行转置和重塑以使数据成为所需的形状和顺序。可能有更简单的方法。)
仍然有一个偷偷摸摸的 for
循环,其形式是对权重索引的列表理解,但我们将 for 循环内的操作最小化以创建一组切片视图。使用 sliding_window_view
, but it's not clear if that would improve performance; or stride_tricks.as_strided
(see answers to ).
可能会避免循环
作为我学术项目的一部分,我正在研究图像的线性滤波器。下面是代码,仅使用 NumPy(无外部库)并希望通过矢量化或任何其他选项消除 for 循环。如何实现矢量化以加快执行速度?感谢您的帮助。
输入-
- Image.shape - (568, 768)
- weightArray.shape - (3, 3)
def apply_filter(image: np.array, weight_array: np.array) -> np.array:
rows, cols = image.shape
height, width = weight_array.shape
output = np.zeros((rows - height + 1, cols - width + 1))
for rrow in range(rows - height + 1):
for ccolumn in range(cols - width + 1):
for hheight in range(height):
for wwidth in range(width):
imgval = image[rrow + hheight, ccolumn + wwidth]
filterval = weight_array[hheight, wwidth]
output[rrow, ccolumn] += imgval * filterval
return output
向量化是将每个显式 for
循环转换为 1 维 数组运算 的过程。
在 Python 中,这将涉及根据 slices.
在下面的代码中,我提供了内核循环的工作矢量化。这显示了如何进行矢量化,但由于它仅优化 3x3 数组,因此不会为您提供最大的可用收益。
如果您想看到更大的改进,您将向量化图像数组,我也为您制作了模板——但留下一些作为练习。
import numpy as np
from PIL import Image
## no vectorization
def applyFilterMethod1(image: np.array, weightArray: np.array) -> np.array:
rows, cols = image.shape ; height, width = weightArray.shape
output = np.zeros((rows - height + 1, cols - width + 1))
for rrow in range(rows - height + 1):
for ccolumn in range(cols - width + 1):
for hheight in range(height):
for wwidth in range(width):
imgval = image[rrow + hheight, ccolumn + wwidth]
filterval = weightArray[hheight, wwidth]
output[rrow, ccolumn] += imgval * filterval
return output
## vectorize the kernel loop (~3x improvement)
def applyFilterMethod2(image: np.array, weightArray: np.array) -> np.array:
rows, cols = image.shape ; height, width = weightArray.shape
output = np.zeros((rows - height + 1, cols - width + 1))
for rrow in range(rows - height + 1):
for ccolumn in range(cols - width + 1):
imgval = image[rrow:rrow + height, ccolumn:ccolumn + width]
filterval = weightArray[:, :]
output[rrow, ccolumn] = sum(sum(imgval * filterval))
return output
## vectorize the image loop (~50x improvement)
def applyFilterMethod3(image: np.array, weightArray: np.array) -> np.array:
rows, cols = image.shape ; height, width = weightArray.shape
output = np.zeros((rows - height + 1, cols - width + 1))
for hheight in range(height):
for wwidth in range(width):
imgval = 0 ## TODO -- construct a compatible slice
filterval = weightArray[hheight, wwidth]
output[:, :] += imgval * filterval
return output
src = Image.open("input.png")
sb = np.asarray(src)
cb = np.array([[1,2,1],[2,4,2],[1,2,1]])
cb = cb/sum(sum(cb)) ## normalize
db = applyFilterMethod2(sb, cb)
dst = Image.fromarray(db)
dst.convert("L").save("output.png")
#src.show() ; dst.show()
注意:您可能会删除所有四个 for
循环,但会增加一些复杂性。但是,因为这只会消除 9 次迭代的开销(在本例中),所以我估计它不会比 applyFilterMethod3
产生任何额外的性能提升。此外,虽然我没有尝试过,但我想象的完成方式可能会增加比它会减少的更多的开销。
仅供参考:这是一个标准的图像卷积(仅支持实现的灰度)。我总是想指出,为了在数学上是正确的,这需要补偿几乎每个默认图像编码中隐含的 gamma compression——但这个小细节经常被忽略。
讨论
这种类型的向量化在 Python 中通常是必需的,具体来说,因为标准 Python 解释器是算法的 extremely inefficient at processing large for
loops. Explicitly iterating over each pixel of an image, therefore, wastes a lot time. Ultimately, though, the vectorized implementation does not change the amount of real work performed, so we're only talking about eliminating an overhead 方面。
然而,矢量化有一个副作用:Parallelization. Lumping a large amount of data processing onto a single operator gives the language/library more flexibility in how to optimize the execution. This might include executing your embarrassingly parallel operation on a GPU -- if you have the right tools, for example the Tensorflow image module。
Python 对 array programming 的无缝支持是它在机器学习中变得非常流行的原因之一,这可能是计算密集型的。
解决方案
这是 imgval
赋值的解决方案,上面留作练习。
imgval = image[hheight:hheight+rows - height+1, wwidth:wwidth+cols - width +1]
您可以构造图像的切片视图数组,每个切片视图移动权重数组的索引,然后将其乘以权重并求和。
def apply_filter(image: np.array, weights: np.array) -> np.array:
height, width = weights.shape
indices = np.indices(weights.shape).T.reshape(weights.size, 2)
views = np.array([image[r:-height+r,c:-width+c] for r, c in indices])
return np.inner(views.T, weights.T.flatten()).T # sum product
(我不得不在几个点上进行转置和重塑以使数据成为所需的形状和顺序。可能有更简单的方法。)
仍然有一个偷偷摸摸的 for
循环,其形式是对权重索引的列表理解,但我们将 for 循环内的操作最小化以创建一组切片视图。使用 sliding_window_view
, but it's not clear if that would improve performance; or stride_tricks.as_strided
(see answers to