对（2-d）点距离的不同方法的运行时间感到困惑

Question

最近我正在 python (3.7) 开发一款塔防游戏。在这个游戏的代码中，我必须检查很多不同二维点之间的距离。所以我想看看完成这个任务最快的方法是什么

具体的规范我不是很确定。 2 范数似乎是自然选择，但我不反对更改为无穷大范数，如果执行时间需要的话 (https://en.wikipedia.org/wiki/Lp_space)。

我的印象是，列表理解比 for 循环更快，而且内置函数（例如在 numpy 中）也比我能写的大多数代码都快。我测试了一些计算距离的替代方案，但很惊讶我无法证实这种印象。令我惊讶的是，numpy-linalg.norm-函数的表现非常糟糕。这是我用来测试运行时的代码：

import timeit
import numpy as np
import test_extension

number_of_repeats = 10000

tower_pos = (3,5)
enemy_pos = ((1,2),(2,3),(3,4),(4,5),(5,6),(6,7),(7,8),(8,9))
distance = 3
options = ('np.linalg 2 norm', 'np.linalg inf norm', 'inf norm manuel', 'manuel squared 2 norm', 'hard coded squared 2 norm', 'c-extension hard coded squared 2 norm', 'all in extension')

def list_comprehension(option):
    if option == 0:
        return [pos for pos in enemy_pos if np.linalg.norm(np.subtract(tower_pos, pos)) <= distance]
    elif option == 1:
        return [pos for pos in enemy_pos if np.linalg.norm(np.subtract(tower_pos, pos), ord = 1) <= distance]
    elif option == 2:
        return [pos for pos in enemy_pos if max(abs(np.subtract(tower_pos, pos))) <= distance]
    elif option == 3:
        return [pos for pos in enemy_pos if sum(np.square(np.subtract(tower_pos, pos))) <= distance**2]
    elif option == 4:
        return [pos for pos in enemy_pos for distance_vector in [np.subtract(tower_pos, pos)] if distance_vector[0]*distance_vector[0]+distance_vector[1]*distance_vector[1] <= distance**2 ]
    elif option == 5:
        return [pos for pos in enemy_pos if test_extension.distance(np.subtract(tower_pos, pos)) <= distance**2]
    elif option == 6:
        return test_extension.list_comprehension(tower_pos, enemy_pos, distance)

def for_loop(option):
    l = []
    if option == 0:
        for pos in enemy_pos:
            distance_vector = np.subtract(tower_pos, pos)
            norm = np.linalg.norm(distance_vector)
            if norm <= distance:
                l.append(pos)
    elif option == 1:
        for pos in enemy_pos:
            distance_vector = np.subtract(tower_pos, pos)
            norm = np.linalg.norm(distance_vector, ord = np.inf)
            if norm <= distance:
                l.append(pos)
    elif option == 2:
        for pos in enemy_pos:
            distance_vector = np.subtract(tower_pos, pos)
            norm = max(abs(distance_vector))
            if norm <= distance:
                l.append(pos)
    elif option == 3:
        for pos in enemy_pos:
            distance_vector = np.subtract(tower_pos, pos)
            norm = sum(distance_vector * distance_vector)
            if norm <= distance**2:
                l.append(pos)
    elif option == 4:
        for pos in enemy_pos:
            distance_vector = np.subtract(tower_pos, pos)
            norm = distance_vector[0]*distance_vector[0]+distance_vector[1]*distance_vector[1]
            if norm <= distance**2:
                l.append(pos)
    elif option == 5:
        d = test_extension.distance
        for pos in enemy_pos:
            distance_vector = np.subtract(tower_pos, pos)
            norm = d(distance_vector)
            if norm <= distance**2:
                l.append(pos)
    elif option == 6:
        return test_extension.for_loop(tower_pos, enemy_pos, distance)
    return l 

print(f"{'_'.ljust(40)}list_comprehension   for_loop")
for i,option in enumerate(options):
    times = [timeit.timeit(f'{method}({i})', number = number_of_repeats, globals = {f'{method}': globals()[f'{method}']}) for method in ['list_comprehension', 'for_loop']]
    s = f'{option}:'.ljust(40)
    print(f"{s}{round(times[0],5)}{' '*15}  {round(times[1],5)}")

给出了以下输出：

                                        list_comprehension   for_loop
np.linalg 2 norm:                       3.58053                 3.56676
np.linalg inf norm:                     3.17169                 3.53372
inf norm manuel:                        1.15261                 1.1951
manuel squared 2 norm:                  1.30239                 1.28485
hard coded squared 2 norm:              1.11324                 1.08586
c-extension hard coded squared 2 norm:  1.01506                 1.05114
all in extension:                       0.81358                 0.81262

test_extension 包含以下代码。然后我使用包 cython (https://cython.readthedocs.io/en/latest/src/tutorial/cython_tutorial.html):

从 test_extension 创建了一个 c-extension

import numpy as np

def distance(vector: np.array):
    return vector[0]*vector[0]+vector[1]*vector[1]

def for_loop(single_pos, pos_list, distance):
    l = []
    for pos in pos_list:
        distance_vector = np.subtract(single_pos, pos)
        norm = distance_vector[0]*distance_vector[0]+distance_vector[1]*distance_vector[1]
        if norm <= distance**2:
            l.append(pos)
    return l

def list_comprehension(single_pos, pos_list, distance):
    return [pos for pos in pos_list for distance_vector in [np.subtract(single_pos, pos)] if distance_vector[0]*distance_vector[0]+distance_vector[1]*distance_vector[1] <= distance**2 ]

目前我主要有以下3个问题，但请随时提供您可以分享的其他见解：

为什么 np.linalg.norms 中的构建如此缓慢？我用错了吗？
为什么硬编码向量 v 的平方 2 范数比使用 sum(np.square(v)) 更好？
我是否错过了更快的距离计算方法（2 范数）？

Answer 1

对于游戏开发，当您需要频繁查询附近有什么物体时，您可能需要实现 spatial partitioning。

此答案的其余部分将处理观察到的 numpy 行为。首先，列表推导式 marginally faster 而非循环（假设您正在构建一个您想要保留的可迭代对象）。

现在，让我们设置一个函数来收集时间统计信息。我们还将创建一些我们经常使用的数组，我们需要排除它们的创建时间。

def timer(func):
    times = timeit.repeat(func, number=number_of_repeats, repeat=10)
    return f'{np.mean(times):.5f} ({np.std(times):.5f})'

enemy_numpy = np.array(enemy_pos)
tower_numpy = np.array(tower_pos)

enemy_pos_large = enemy_pos * 20
enemy_numpy_large = np.array(enemy_pos_large)

Python 级循环比 numpy 操作慢。但是，大多数 numpy 操作都会创建一个新数组，其分配会占用很长时间。最重要的是，函数调用有其自身的开销。与仅硬编码几个索引获取器和操作相比，所有这些前期成本可能占主导地位，即使在缓慢的 python 级别也是如此。

np.subtract(enemy_pos[i] - tower_pos) 创建两个 2 元素数组，然后为结果创建一个最终的 2 元素数组。分配时间支配着所涉及的少量操作。

>>> timer(lambda: np.subtract(enemy_pos[0], tower_pos))
'0.03206 (0.00031)'
>>> # initial array allocation
>>> timer(lambda: (np.array(enemy_pos[0]), np.array(tower_pos)))
'0.02101 (0.00014)'
>>> # straight up numpy subtraction of one enemy position
>>> timer(lambda: enemy_numpy[0] - tower_numpy)
'0.00636 (0.00019)'
>>> # pure python with creating tuples
>>> timer(lambda: (enemy_pos[0][0] - tower_pos[0], enemy_pos[0][1] - tower_pos[1]))
'0.00145 (0.00006)'

对于更大的数组，元组到数组的转换仍然占主导地位，但输出的分配和 numpy 操作的速度很出色。

>>> # subtract all positions at once with `np.subtract`
>>> timer(lambda: np.subtract(enemy_pos, tower_pos))
'0.09057 (0.00029)'
>>> # same for the bigger array; 20x bigger, but only 10x slower
>>> timer(lambda: np.subtract(enemy_pos_large, tower_pos))
'1.07190 (0.00133)'
>>> # pure python beats numpy by an order of magnitude for the small list
>>> timer(lambda: [(ex - tower_pos[0], ey - tower_pos[1]) for ex, ey in enemy_pos])
'0.00918 (0.00007)'
>>> # pure python for the big list still wins, 20x bigger, 16x slower
>>> timer(lambda: [(ex - tower_pos[0], ey - tower_pos[1]) for ex, ey in enemy_pos_large])
'0.14962 (0.00055)'
>>> # preallocated small numpy arrays are barely edged out by pure python
>>> timer(lambda: enemy_numpy - tower_numpy)
'0.01140 (0.00014)'
>>> # preallocated big numpy arrays are outperforming pure python; 20x bigger, 2x slower
>>> timer(lambda: enemy_numpy_large - tower_numpy)
'0.01817 (0.00028)'

vector[0]*vector[0] + vector[1]*vector[1] 对两个整数进行平方，然后将它们相加。它避免了为平方和求和创建新的数组，因此 python 级的慢操作排挤了 numpy。但是对于更大的数组，情况会有所不同。

>>> v2 = np.arange(2)
>>> timer(lambda: v2[0]**2 + v2[1]**2)
'0.00741 (0.00020)'
>>> timer(lambda: np.add.reduce(v2**2))
'0.01756 (0.00027)'
>>> v6 = np.arange(6)
>>> # too many slow python operations
>>> timer(lambda: v6[0]**2 + v6[1]**2 + v6[2]**2 + v6[3]**2 + v6[4]**2 + v6[5]**2)
'0.02321 (0.00028)'
>>> # better off creating the intermediate arrays and letting numpy do its thing
>>> timer(lambda: np.add.reduce(v6**2))
'0.01744 (0.00016)'

底线：将 python 可迭代对象解析为 numpy 数组的代价很高。 Numpy 喜欢一个方法调用一次作用于整个数据。如果你只能初始化你的数组一次然后使用，那么除了非常小的尺寸之外，中间输出数组将无关紧要，你将获得快速 numpy 操作的优势。

现在，让我们编写使用列表数组版本的方法的 numpy 版本，看看与 python 循环相比它们的性能如何。我们还将其与从未调用过 numpy 函数的纯 python 函数进行比较。请注意，一些选项被跳过，因为它们的代码与以前的方法相同。

def numpified(option):
    if option == 0:
        # we can improve this timing if the arrays are float by default,
        # as `np.linalg.norm` will convert them otherwise
        return enemy_numpy[np.linalg.norm(tower_numpy - enemy_numpy, axis=1) <= distance]
    elif option == 1:
        return enemy_numpy[np.linalg.norm(tower_numpy - enemy_numpy, axis=1, ord=np.inf) <= distance]
    elif option == 2:
        return enemy_numpy[np.max(np.abs(tower_numpy - enemy_numpy), axis=1) <= distance]
    elif option == 3:
        return enemy_numpy[np.sum((tower_numpy - enemy_numpy)**2, axis=1) <= distance**2]
    elif option == 4:
        # same as option == 3
        return []
    elif option == 5:
        temp = (tower_numpy - enemy_numpy)**2
        return enemy_numpy[test_extension.distance2d(tower_numpy - enemy_numpy) <= distance**2]
    elif option == 6:
        # same as option == 5
        return []

def pure(option):
    l = []
    tx, ty = tower_pos
    if option == 0:
        for ex, ey in enemy_pos:
            x = tx - ex
            y = ey - ey
            # since we're dealing with real numbers, |x|**2 is just x**2.
            # We can also skip taking the square root by comparing against `distance**2`
            if x*x + y*y <= distance**2:
                l.append((ex, ey))
    elif option == 1:
        for ex, ey in enemy_pos:
            if max((abs(tx - ex), abs(ty - ey))) <= distance:
                l.append((ex, ey))
    else:
        # the rest are the same to the above
        return []

methods = [list_comprehension, for_loop, numpified, pure]
print('{:40s}'.format('_') + ''.join(f'{m.__name__:20s}' for m in methods))
for i, option in enumerate(options):
    times = ''.join(f'{timer(lambda: m(i)):20s}' for m in methods)
    print(f'{option:40s}{times}')

在test_extension模块中还要添加如下函数

def distance2d(vector: np.array):
    return vector[:,0]*vector[:,0]+vector[:,1]*vector[:,1]

时间

_                                       list_comprehension  for_loop            numpified           pure                
np.linalg 2 norm                        0.61583 (0.00289)   0.61282 (0.00278)   0.12935 (0.00600)   0.02203 (0.00016)   
np.linalg inf norm                      0.68147 (0.00486)   0.68253 (0.00515)   0.09752 (0.00338)   0.02001 (0.00009)   
inf norm manuel                         0.38958 (0.00235)   0.38934 (0.00275)   0.07727 (0.00096)   0.00151 (0.00002)   
manuel squared 2 norm                   0.43648 (0.00114)   0.42091 (0.00053)   0.08799 (0.00165)   0.00152 (0.00001)   
hard coded squared 2 norm               0.34672 (0.00061)   0.34031 (0.00067)   0.00162 (0.00002)   0.00152 (0.00001)   
c-extension hard coded squared 2 norm   0.34750 (0.00097)   0.34609 (0.00089)   0.09728 (0.00251)   0.00152 (0.00002)   
all in extension                        0.34823 (0.00082)   0.34101 (0.00087)   0.00184 (0.00001)   0.00154 (0.00001)

我们可以看到 pure python 实际上做得更好。现在让我们尝试使用更大的数组，只比较 numpified 和 pure。在运行定时循环之前再次设置以下内容。

enemy_pos = enemy_pos_large
enemy_numpy = enemy_numpy_large
methods = [numpified, pure]

时间

_                                       numpified           pure                
np.linalg 2 norm                        0.13939 (0.00460)   0.41680 (0.00130)   
np.linalg inf norm                      0.12642 (0.00134)   0.38518 (0.00072)   
inf norm manuel                         0.10712 (0.00107)   0.00155 (0.00004)   
manuel squared 2 norm                   0.12589 (0.00057)   0.00156 (0.00004)   
hard coded squared 2 norm               0.00165 (0.00007)   0.00153 (0.00002)   
c-extension hard coded squared 2 norm   0.12898 (0.00288)   0.00154 (0.00002)   
all in extension                        0.00199 (0.00005)   0.00166 (0.00004)

您可能可以通过使用 numba 来提高性能，但在研究之前，请考虑 空间分区。

对（2-d）点距离的不同方法的运行时间感到困惑

confused about runtime of differents methods for distance of (2-d) points

python

numpy

execution-time

python-3.x

对于游戏开发，当您需要频繁查询附近有什么物体时，您可能需要实现 spatial partitioning。