Numba 在它应该擅长的任务中比 python 慢 10 倍
Numba is 10x slower than python equivalent in task which it should be good at
我有以下功能:
def dewarp(image, destination_image, pixels, strength, zoom, pts, players):
height = image.shape[0]
width = image.shape[1]
half_height = height / 2
half_width = width / 2
pts_transformed = np.empty((0, 2))
players_transformed = np.empty((0, 2))
correctionRadius = sqrt(width ** 2 + height ** 2) / strength
for x_p, y_p in pixels:
newX = x_p - half_width
newY = y_p - half_height
distance = sqrt(newX ** 2 + newY ** 2)
r = distance / correctionRadius
if r == 0:
theta = 1
else:
theta = atan(r) / r
sourceX = int(half_width + theta * newX * zoom)
sourceY = int(half_height + theta * newY * zoom)
if 0 < sourceX < width and 0 < sourceY < height:
destination_image[y_p, x_p, :] = image[sourceY, sourceX, :]
if (sourceX, sourceY) in pts:
pts_transformed = np.vstack((pts_transformed, np.array([[x_p, y_p]])))
if (sourceX, sourceY) in players:
players_transformed = np.vstack((players_transformed, np.array([[x_p, y_p]])))
return destination_image, pts_transformed, players_transformed
参数是:
图像和目标图像:均为 3840x800x3 numpy 数组
pixels 是像素组合的列表,我也试过双 for 循环,但结果是一样的
strength 和 zoom 都是浮动的
积分和球员都是python集
这个纯python版本大约需要4秒,numba版本通常需要30秒左右。这怎么可能?
我用过 dewarp.inspect_types 并且 numba 似乎没有处于对象模式。
为了方便起见,如果您想重新创建示例,可以将其用作图像、目标图像、pts 和播放器并自行检查:
pts = {(70, 667),
(70, 668),
(71, 667),
(71, 668),
(1169, 94),
(1169, 95),
(1170, 94),
(1170, 95),
(2699, 86),
(2699, 87),
(2700, 86),
(2700, 87),
(3794, 641),
(3794, 642),
(3795, 641),
(3795, 642)}
players = {(1092, 257),
(1092, 258),
(1093, 257),
(1093, 258),
(1112, 252),
(1112, 253),
(1113, 252),
(1113, 253),
(1155, 167),
(1155, 168),
(1156, 167),
(1156, 168),
(1158, 357),
(1158, 358),
(1159, 357),
(1159, 358),
(1246, 171),
(1246, 172),
(1247, 171),
(1247, 172),
(1260, 257),
(1260, 258),
(1261, 257),
(1261, 258),
(1280, 273),
(1280, 274),
(1281, 273),
(1281, 274),
(1356, 410),
(1356, 411),
(1357, 410),
(1357, 411),
(1385, 158),
(1385, 159),
(1386, 158),
(1386, 159),
(1406, 199),
(1406, 200),
(1407, 199),
(1407, 200),
(1516, 481),
(1516, 482),
(1517, 481),
(1517, 482),
(1639, 297),
(1639, 298),
(1640, 297),
(1640, 298),
(1806, 148),
(1806, 149),
(1807, 148),
(1807, 149),
(1807, 192),
(1807, 193),
(1808, 192),
(1808, 193),
(1834, 285),
(1834, 286),
(1835, 285),
(1835, 286),
(1875, 199),
(1875, 200),
(1876, 199),
(1876, 200),
(1981, 206),
(1981, 207),
(1982, 206),
(1982, 207),
(1990, 326),
(1990, 327),
(1991, 326),
(1991, 327),
(2021, 355),
(2021, 356),
(2022, 355),
(2022, 356),
(2026, 271),
(2026, 272),
(2027, 271),
(2027, 272)}
image = np.zeros((800, 3840, 3))
destination_image = np.zeros((800, 3840, 3))
我错过了什么吗?这只是numba不能做的事情吗?我应该写不同的吗?谢谢!
行分析器显示很多,但不是大多数是由 numpy 完成的。所以应该有改进的空间吧?
我不明白为什么这个算法会从使用 numba 中看到任何显着的好处。所有的提升似乎都在图像复制和 np.vstack
部分。这一切都在 numpy 中,所以 numba 不会在那里提供帮助。您迭代使用 vstack
的方式也有糟糕的性能。您最好构建一个子数组列表,最后将它们一次性堆叠在一起。
请问是什么问题
dewarp.inspect_types()
输出?它应该向您显示 numba 需要与 Python 接口的位置。如果这是在循环中的任何地方完成的,那么如果您的程序是多线程的,性能将会受到影响。
无论你是否使用 Numba,你都应该避免在循环中递增增长一个数组,因为它的性能很差,你应该预分配一个数组并一个一个地填充它(因为你可能不知道预先确定确切的大小,你可以用最大的可能预分配它,比如 len(pixels)
,并在末尾切出未使用的 space)。但是,您的代码可以或多或少直接的方式进行矢量化。
import numpy as np
def dewarp_vec(image, destination_image, pixels, strength, zoom, pts, players):
height = image.shape[0]
width = image.shape[1]
half_height = height / 2
half_width = width / 2
correctionRadius = np.sqrt(width ** 2 + height ** 2) / strength
x_p, y_p = np.asarray(pixels).T
newX = x_p - half_width
newY = y_p - half_height
distance = np.sqrt(newX ** 2 + newY ** 2)
r = distance / correctionRadius
theta = np.arctan(r) / r
theta[r == 0] = 1
sourceX = (half_width + theta * newX * zoom).astype(np.int32)
sourceY = (half_height + theta * newY * zoom).astype(np.int32)
m1 = (0 < sourceX) & (sourceX < width) & (0 < sourceY) & (sourceY < height)
x_p, y_p, sourceX, sourceY = x_p[m1], y_p[m1], sourceX[m1], sourceY[m1]
destination_image[y_p, x_p, :] = image[sourceY, sourceX, :]
source_flat = sourceY * width + sourceX
pts_x, pts_y = np.asarray(list(pts)).T
pts_flat = pts_y * width + pts_x
players_x, players_y = np.asarray(list(players)).T
players_flat = players_y * width + players_x
m_pts = np.isin(source_flat, pts_flat)
m_players = np.isin(source_flat, players_flat)
pts_transformed = np.stack([x_p[m_pts], y_p[m_pts]], axis=1)
players_transformed = np.stack([x_p[m_players], y_p[m_players]], axis=1)
return destination_image, pts_transformed, players_transformed
与您的代码比较不同的部分是如何检查 (sourceX, sourceY)
是否在 pts
和 players
中。为此,我计算了 "flat" 像素索引并使用 np.isin
代替(如果您知道每个输入中不会有重复的坐标对,您可以添加 assume_unique=True
)。
我有以下功能:
def dewarp(image, destination_image, pixels, strength, zoom, pts, players):
height = image.shape[0]
width = image.shape[1]
half_height = height / 2
half_width = width / 2
pts_transformed = np.empty((0, 2))
players_transformed = np.empty((0, 2))
correctionRadius = sqrt(width ** 2 + height ** 2) / strength
for x_p, y_p in pixels:
newX = x_p - half_width
newY = y_p - half_height
distance = sqrt(newX ** 2 + newY ** 2)
r = distance / correctionRadius
if r == 0:
theta = 1
else:
theta = atan(r) / r
sourceX = int(half_width + theta * newX * zoom)
sourceY = int(half_height + theta * newY * zoom)
if 0 < sourceX < width and 0 < sourceY < height:
destination_image[y_p, x_p, :] = image[sourceY, sourceX, :]
if (sourceX, sourceY) in pts:
pts_transformed = np.vstack((pts_transformed, np.array([[x_p, y_p]])))
if (sourceX, sourceY) in players:
players_transformed = np.vstack((players_transformed, np.array([[x_p, y_p]])))
return destination_image, pts_transformed, players_transformed
参数是: 图像和目标图像:均为 3840x800x3 numpy 数组 pixels 是像素组合的列表,我也试过双 for 循环,但结果是一样的 strength 和 zoom 都是浮动的 积分和球员都是python集
这个纯python版本大约需要4秒,numba版本通常需要30秒左右。这怎么可能?
我用过 dewarp.inspect_types 并且 numba 似乎没有处于对象模式。
为了方便起见,如果您想重新创建示例,可以将其用作图像、目标图像、pts 和播放器并自行检查:
pts = {(70, 667),
(70, 668),
(71, 667),
(71, 668),
(1169, 94),
(1169, 95),
(1170, 94),
(1170, 95),
(2699, 86),
(2699, 87),
(2700, 86),
(2700, 87),
(3794, 641),
(3794, 642),
(3795, 641),
(3795, 642)}
players = {(1092, 257),
(1092, 258),
(1093, 257),
(1093, 258),
(1112, 252),
(1112, 253),
(1113, 252),
(1113, 253),
(1155, 167),
(1155, 168),
(1156, 167),
(1156, 168),
(1158, 357),
(1158, 358),
(1159, 357),
(1159, 358),
(1246, 171),
(1246, 172),
(1247, 171),
(1247, 172),
(1260, 257),
(1260, 258),
(1261, 257),
(1261, 258),
(1280, 273),
(1280, 274),
(1281, 273),
(1281, 274),
(1356, 410),
(1356, 411),
(1357, 410),
(1357, 411),
(1385, 158),
(1385, 159),
(1386, 158),
(1386, 159),
(1406, 199),
(1406, 200),
(1407, 199),
(1407, 200),
(1516, 481),
(1516, 482),
(1517, 481),
(1517, 482),
(1639, 297),
(1639, 298),
(1640, 297),
(1640, 298),
(1806, 148),
(1806, 149),
(1807, 148),
(1807, 149),
(1807, 192),
(1807, 193),
(1808, 192),
(1808, 193),
(1834, 285),
(1834, 286),
(1835, 285),
(1835, 286),
(1875, 199),
(1875, 200),
(1876, 199),
(1876, 200),
(1981, 206),
(1981, 207),
(1982, 206),
(1982, 207),
(1990, 326),
(1990, 327),
(1991, 326),
(1991, 327),
(2021, 355),
(2021, 356),
(2022, 355),
(2022, 356),
(2026, 271),
(2026, 272),
(2027, 271),
(2027, 272)}
image = np.zeros((800, 3840, 3))
destination_image = np.zeros((800, 3840, 3))
我错过了什么吗?这只是numba不能做的事情吗?我应该写不同的吗?谢谢!
行分析器显示很多,但不是大多数是由 numpy 完成的。所以应该有改进的空间吧?
我不明白为什么这个算法会从使用 numba 中看到任何显着的好处。所有的提升似乎都在图像复制和 np.vstack
部分。这一切都在 numpy 中,所以 numba 不会在那里提供帮助。您迭代使用 vstack
的方式也有糟糕的性能。您最好构建一个子数组列表,最后将它们一次性堆叠在一起。
请问是什么问题
dewarp.inspect_types()
输出?它应该向您显示 numba 需要与 Python 接口的位置。如果这是在循环中的任何地方完成的,那么如果您的程序是多线程的,性能将会受到影响。
无论你是否使用 Numba,你都应该避免在循环中递增增长一个数组,因为它的性能很差,你应该预分配一个数组并一个一个地填充它(因为你可能不知道预先确定确切的大小,你可以用最大的可能预分配它,比如 len(pixels)
,并在末尾切出未使用的 space)。但是,您的代码可以或多或少直接的方式进行矢量化。
import numpy as np
def dewarp_vec(image, destination_image, pixels, strength, zoom, pts, players):
height = image.shape[0]
width = image.shape[1]
half_height = height / 2
half_width = width / 2
correctionRadius = np.sqrt(width ** 2 + height ** 2) / strength
x_p, y_p = np.asarray(pixels).T
newX = x_p - half_width
newY = y_p - half_height
distance = np.sqrt(newX ** 2 + newY ** 2)
r = distance / correctionRadius
theta = np.arctan(r) / r
theta[r == 0] = 1
sourceX = (half_width + theta * newX * zoom).astype(np.int32)
sourceY = (half_height + theta * newY * zoom).astype(np.int32)
m1 = (0 < sourceX) & (sourceX < width) & (0 < sourceY) & (sourceY < height)
x_p, y_p, sourceX, sourceY = x_p[m1], y_p[m1], sourceX[m1], sourceY[m1]
destination_image[y_p, x_p, :] = image[sourceY, sourceX, :]
source_flat = sourceY * width + sourceX
pts_x, pts_y = np.asarray(list(pts)).T
pts_flat = pts_y * width + pts_x
players_x, players_y = np.asarray(list(players)).T
players_flat = players_y * width + players_x
m_pts = np.isin(source_flat, pts_flat)
m_players = np.isin(source_flat, players_flat)
pts_transformed = np.stack([x_p[m_pts], y_p[m_pts]], axis=1)
players_transformed = np.stack([x_p[m_players], y_p[m_players]], axis=1)
return destination_image, pts_transformed, players_transformed
与您的代码比较不同的部分是如何检查 (sourceX, sourceY)
是否在 pts
和 players
中。为此,我计算了 "flat" 像素索引并使用 np.isin
代替(如果您知道每个输入中不会有重复的坐标对,您可以添加 assume_unique=True
)。