Cython 似乎通过减少时间分析器的开销而不是核心代码来提供加速?

Cython seems to provide speed-up by reducing the overhead in time profiler rather than the core code?

我试图学习和使用 cython 来加速我的个人项目,但我发现了一些奇怪的事情。

示例:

尝试 http://nealhughes.net/cython1/

中的 rbf_network 示例
import pyximport; pyximport.install()
from src.test_cython import rbf_network  # File1: /src/test_cython.pyx
# from src.test import rbf_network       # File2: /src/test.py
import time
import cProfile
import numpy as np

def fun():
    D = 5
    N = 1000
    X = np.array([np.random.rand(N) for d in range(D)]).T
    beta = np.random.rand(N)
    theta = 10
    rbf_network(X, beta, theta)

# With CProfile
cProfile.run('fun()', sort='cumtime')

# Without Cprofile
start = time.time()
fun()
print("Time without CProfile: ", time.time() - start)

File1 和 File2 都包含:

from math import exp
import numpy as np

def rbf_network(X, beta, theta):

    N = X.shape[0]
    D = X.shape[1]
    Y = np.zeros(N)

    for i in range(N):
        for j in range(N):
            r = 0
            for d in range(D):
                r += (X[j, d] - X[i, d]) ** 2
            r = r**0.5
            Y[i] += beta[j] * exp(-(r * theta)**2)

    return Y

File1 上的输出(cythonized):

     13 function calls in 3.920 seconds

Ordered by: cumulative time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000    3.920    3.920 {built-in method builtins.exec}
    1    0.000    0.000    3.920    3.920 <string>:1(<module>)
    1    0.000    0.000    3.920    3.920 run.py:138(fun)
    1    3.920    3.920    3.920    3.920 {src.test_cython.rbf_network}
    1    0.000    0.000    0.000    0.000 run.py:141(<listcomp>)
    6    0.000    0.000    0.000    0.000 {method 'rand' of 'mtrand.RandomState' objects}
    1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.array}
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


Time without CProfile:  3.899562120437622

File2 上的输出(非 cython):

         1000014 function calls in 13.193 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000   13.193   13.193 {built-in method builtins.exec}
    1    0.000    0.000   13.193   13.193 <string>:1(<module>)
    1    0.000    0.000   13.193   13.193 run.py:138(fun)
    1    7.948    7.948   13.193   13.193 test.py:4(rbf_network)
  1000000    5.245    0.000    5.245    0.000 {built-in method math.exp}
    1    0.000    0.000    0.000    0.000 run.py:141(<listcomp>)
    6    0.000    0.000    0.000    0.000 {method 'rand' of 'mtrand.RandomState' objects}
    1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.array}
    1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.zeros}
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


Time without CProfile:  4.139716863632202

简而言之,当使用 cProfile 测量时,Cythonized 代码似乎从 13.19 秒提高到 3.920 秒,但当使用内置时间测量时,实际上只提高了 4.13 秒到 3.89 秒。

Cython 确实提供了一些加速(即使是在天真地使用时),但通过时间分析器测量加速似乎夸大了结果。也许这些时间分析器通过使用 cython 而不是核心代码而受益。这是真的还是我做错了什么?

编辑:另外,我不确定为什么 cProfile 在 cythonized 代码中没有跟踪{内置方法 math.exp}。

python 配置文件模块 docs 直接解决这个问题

Note The profiler modules are designed to provide an execution profile for a given program, not for benchmarking purposes (for that, there is timeit for reasonably accurate results). This particularly applies to benchmarking Python code against C code: the profilers introduce overhead for Python code, but not for C-level functions, and so the C code would seem faster than any Python one.