CPU 和 GPU 生成的结果不匹配

Question

我正准备将 Numba 与我的 AMD GPU 一起使用。我从他们网站上提供的最基本示例开始，使用蒙特卡洛模拟计算 Pi 的值。

我对代码做了一些更改，以便它可以先在 GPU 上运行，然后再在 CPU 上。通过这样做，我只是想比较执行代码所花费的时间并验证结果。下面是代码：

from numba import jit
import random
from timeit import default_timer as timer

@jit(nopython=True)
def monte_carlo_pi(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

def monte_carlo_pi_cpu(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

num = int(input())

start = timer()
random.seed(0)
print(monte_carlo_pi(num))
print("with gpu", timer()-start)

start = timer()
random.seed(0)
print(monte_carlo_pi_cpu(num))
print("without gpu", timer()-start)

我期待 GPU 性能更好，结果确实如此。但是，CPU 和 CPU 的某些结果不匹配。

1000000 # input parameter
3.140836 # gpu_out
with gpu 0.2317520289998356
3.14244 # cpu_out
without gpu 0.39849199899981613

我知道 Python 不能很好地处理长浮点运算，但它们只有 6 位小数，我没想到会有这么大的差异。谁能解释为什么会出现这种差异？

Answer 1

我稍微重组了你的代码：

import numpy 
from numba import jit
import random
from timeit import default_timer as timer

@jit(nopython=True)
def monte_carlo_pi(nsamples):
    random.seed(0)
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples


num = 1000000

# run the jitted code once to remove compile time from timing
monte_carlo_pi(10)

start = timer()
print(monte_carlo_pi(num))
print("jitted code", timer()-start)

start = timer()
print(monte_carlo_pi.py_func(num))
print("non-jitted", timer()-start)

结果：

3.140936
jitted code 0.01403845699996964
3.14244
non-jitted 0.39901430800000526

请注意，您不是运行在您的 GPU 上编译 jitted 代码。代码编译好了，但是为了你的CPU。 Pi 的计算值不同的原因可能是由于底层随机数生成器的不同实现。 Numba 实际上并没有使用 Python 的 random 模块，而是有自己的实现来模仿它。事实上，如果您查看源代码，就会发现 numba 实现似乎主要是基于 numpy 的随机模块设计的，然后只是将 random 模块作为别名，因此如果您换出 random.random 对于 np.random.random，使用相同的种子，您会得到相同的结果：

@jit(nopython=True)
def monte_carlo_pi2(nsamples):
    np.random.seed(0)
    acc = 0
    for i in range(nsamples):
        x = np.random.random()
        y = np.random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

结果：

3.140936
jitted code 0.013946142999998301
3.140936
non-jitted 0.9277294739999888

还有一些其他注意事项：

当对 numba jitted 函数进行计时时，总是运行在进行基准测试之前先编译一次函数，这样您就不会在计时中包含一次性编译时间成本
您可以使用 .py_func 访问纯 python 版本的 numba jitted 函数，因此您不必将代码复制两次。

Answer 2

Q : Can anyone explain as to why this difference comes up?

系统使用通过 PRNG-of-choice
.seed( aRepeatableExperimentSeedNUMBER ) 方法重新设置相同状态的可用性和近乎迂腐的关怀是所有这些意外的根本原因。

当且仅当使用相同的 PRNG 算法时 正确的播种工作 - 在 random 中主要不同 -模块的.random()-方法比numpy.random-模块的.random().

中的方法
另一种观察到的伪影（不同的飞镖投掷值 pi-guessstimates）与相当小的尺度（是的，1E6-points 是一个很小的数量，与统计艺术的初始公理相比 - 这是 "using infinitely and only infinitely sized populations" ），其中 different order 使用已经（由于迂腐和系统地重新 seed(0)-ing PRNG-FSA）可重复生成始终相同的值序列的数字，产生不同的结果（参见昨天实验中的值差异).然而，随着尺寸的增长，这些工件的作用越来越不重要（如最底部所示，可重现的实验）：

# 1E+6: 3.138196 # block-wise generation in np.where().sum() # 3.140936 # pair-wise generation in monte_carlo_pi2() # 1E+7: 3.142726 # block-wise generation in np.where().sum() # 3.142358 # pair-wise generation in monte_carlo_pi2() # 3E+7: 3.1421996 # block-wise generation in np.where().sum() # 3.1416629333333335 # pair-wise generation in monte_carlo_pi2() # 1E+8: 3.14178916 # block-wise generation in np.where().sum() # 3.14167324 # pair-wise generation in monte_carlo_pi2() # 1E+9: -. # block-wise generation in np.where().sum() -x-RAM-SWAP- # 3.141618484 # pair-wise generation in monte_carlo_pi2() # 1E10 -. # block-wise generation in np.where().sum() -x-RAM-SWAP- # 3.1415940572 # pair-wise generation in monte_carlo_pi2() # 1E11 -. # block-wise generation in np.where().sum() -x-RAM-SWAP- # 3.14159550084 # pair-wise generation in monte_carlo_pi2()

接下来，让我展示另一个方面：

这样做的实际成本是多少？这些成本从何而来？！？

一个普通的 pure-numpy 代码是在 localhost 上大约 108 [ms]

>>> from zmq import Stopwatch; clk = Stopwatch() # [us]-clock resolution >>> np.random.seed(0); clk.start();x = np.random.random( 1000000 ); y = np.random.random( 1000000 ); _ = ( np.where( x**2 + y**2 < 1.0, 1, 0 ).sum() * 4.0 / 1000000 );clk.stop() 108444 >>> _ 3.138196

_{这里大部分"costs"与memory-I/O流量有关（用于存储两倍的1E6-元素并使它们平方）"halved"问题有"twice" 和 ~ 52.7 [ms]}
一样快
>>> np.random.seed(0); clk.start(); _ = ( np.where( np.random.random( 1000000 )**2 ... + np.random.random()**2 < 1.0, ... 1, ... 0 ... ).sum() * 4.0 / 1000000 ); clk.stop() 52696

无临时存储 numpy-代码在 localhost 上慢了一点~115 [ms]

>>> np.random.seed(0); clk.start(); _ = ( np.where( np.random.random( 1000000 )**2 ... + np.random.random( 1000000 )**2 < 1.0, ... 1, ... 0 ... ).sum() * 4.0 / 1000000 ); clk.stop(); print _ 114501 3.138196

带有 numpy.random PRNG 生成器的普通 python 代码能够计算出相同的结果，但超过 3,937.9+ [ms]（这里你查看 python 的 for-迭代器的循环痛苦 - 4 秒 与 [ 相比=35=] ) 另外，您可以检测到 PRNG 数字序列的生成方式和成对消耗方式的不同顺序（在结果差异中可见）：

>>> def monte_carlo_pi2(nsamples): ... np.random.seed(0) ... acc = 0 ... for i in range(nsamples): ... if ( np.random.random()**2 ... + np.random.random()**2 ) < 1.0: ... acc += 1 ... return 4.0 * acc / nsamples >>> np.random.seed( 0 ); clk.start(); _ = monte_carlo_pi2( 1000000 ); clk.stop(); print _ 3937892 3.140936

A numba.jit()-编译代码是在大约 692 [ms] 内计算相同的代码，因为它必须承担并承担 cost-of-jit-compilation（仅下一个调用将收获此一站式成本的成果，执行时间约为 ~ 50 [ms] ):

>>> @jit(nopython=True) # COPY/PASTE ... def monte_carlo_pi2(nsamples): ... np.random.seed(0) ... acc = 0 ... for i in range(nsamples): ... x = np.random.random() ... y = np.random.random() ... if (x ** 2 + y ** 2) < 1.0: ... acc += 1 ... return 4.0 * acc / nsamples ... >>> np.random.seed( 0 ); clk.start(); _ = monte_carlo_pi2( 1000000 ); clk.stop(); print _ 692811 3.140936 >>> np.random.seed( 0 ); clk.start(); _ = monte_carlo_pi2( 1000000 ); clk.stop(); print _ 50193 3.140936

结语：

成本很重要。总是。 jit 编译的代码可以帮助 当且仅当 LLVM 编译的代码被如此频繁地重复使用，它可以调整初始编译的成本。

_{（如果奥术大师提出合理的反对意见：使用预编译代码的技巧仍然需要支付费用，不是吗？）}

以及值 ?

仅使用 1E6 个样本并不是很有说服力，无论是对于 pi-dart 投掷实验，还是性能基准测试（确实如此微小的小规模数据样本允许在缓存中引入时序人工制品，这些人工制品不会缩放或无法概括）。规模越大，pi-guessstimate 越接近，执行数据效率计算的效果就越好（流式/成对式将比块式更好（由于数据实例化成本和后来与内存交换相关的窒息）如 online reproducible-experimentation sandbox IDE
所示
# 1E6: # 1E6: 3.138196 Real time: 0.262 s User time: 0.268 s Sys. time: 0.110 s ---------------------------- np.where().sum() block-wise # Real time: 0.231 s User time: 0.237 s Sys. time: 0.111 s # # Real time: 0.251 s User time: 0.265 s Sys. time: 0.103 s ---------------------------- np.where( .reshape().sum() ).sum() block-wise # Real time: 0.241 s User time: 0.234 s Sys. time: 0.124 s # # 3.140936 Real time: 1.567 s User time: 1.575 s Sys. time: 0.097 s ---------------------------- monte_carlo_pi2() -- -- -- -- -- -- pair-wise # Real time: 1.556 s User time: 1.557 s Sys. time: 0.102 s # # 1E7: # 1E7: 3.142726 Real time: 0.971 s User time: 0.719 s Sys. time: 0.327 s ---------------------------- np.where().sum() block-wise # Real time: 0.762 s User time: 0.603 s Sys. time: 0.271 s # # Real time: 0.827 s User time: 0.604 s Sys. time: 0.335 s ---------------------------- np.where( .reshape().sum() ).sum() block-wise # Real time: 0.767 s User time: 0.590 s Sys. time: 0.288 s # # 3.142358 Real time: 14.756 s User time: 14.619 s Sys. time: 0.103 s ---------------------------- monte_carlo_pi2() -- -- -- -- -- -- pair-wise # Real time: 14.879 s User time: 14.740 s Sys. time: 0.117 s # # 3E7: # 3E7: 3.1421996 Real time: 1.914 s User time: 1.370 s Sys. time: 0.645 s ---------------------------- np.where().sum() block-wise # Real time: 1.796 s User time: 1.380 s Sys. time: 0.516 s # # Real time: 2.325 s User time: 1.615 s Sys. time: 0.795 s ---------------------------- np.where( .reshape().sum() ).sum() block-wise # Real time: 2.099 s User time: 1.514 s Sys. time: 0.677 s # # 3.1416629333333335 Real time: 50.182 s User time: 49.680 s Sys. time: 0.107 s ---------------------------- monte_carlo_pi2() -- -- -- -- -- -- pair-wise # Real time: 47.240 s User time: 46.711 s Sys. time: 0.103 s # # 1E8: # 1E8: 3.14178916 Real time: 12.970 s User time: 5.296 s Sys. time: 7.273 s ---------------------------- np.where().sum() block-wise # Real time: 8.275 s User time: 6.088 s Sys. time: 2.172 s

而且我们没有谈论最终的性能优势 - 查看有关 cython 的文章，其中有一个选项可以利用 OpenMP 代码作为 python 的下一剂性能提升类固醇

CPU 和 GPU 生成的结果不匹配

Mismatch in results generated by CPU and GPU

python

parallel-processing

performance

gpu

numba

这样做的实际成本是多少？这些成本从何而来？！？

结语：

以及值 ?