如何在cython中使用prange？

Question

我正在尝试使用 Monte Carlo 方法求解 2D-Ising 模型。

因为速度慢我用Cython来加速代码执行。我想进一步推动它并并行化 Cython 代码。我的想法是将 2D 格子一分为二，因此对于一个格子上的任何点，它在另一个格子上都有最近的邻居。这样我就可以随机选择一个格子，我可以翻转所有的自旋，这可以并行完成，因为所有这些自旋都是独立的。

到目前为止，这是我的代码：
_{（灵感来自 http://jakevdp.github.io/blog/2017/12/11/live-coding-cython-ising-model/）}

%load_ext Cython
%%cython 
cimport cython
cimport numpy as np
import numpy as np
from cython.parallel cimport prange

@cython.boundscheck(False)
@cython.wraparound(False)

def cy_ising_step(np.int64_t[:, :] field,float beta):

    cdef int N = field.shape[0]
    cdef int M = field.shape[1]


    cdef int offset = np.random.randint(0,2)

    cdef np.int64_t[:,] n_update = np.arange(offset,N,2,dtype=np.int64)

    cdef int m,n,i,j

    for m in prange(M,nogil=True):
        i = m % 2
        for j in range(n_update.shape[0]) :
            n = n_update[j]

            cy_spin_flip(field,(n+i) %N,m%M,beta)

    return np.array(field,dtype=np.int64)


cdef cy_spin_flip(np.int64_t[:, :] field,int n,int m, float beta=0.4,float J=1.0):

    cdef int N = field.shape[0]
    cdef int M = field.shape[1]

    cdef float dE = 2*J*field[n,m]*(field[(n-1)%N,m]+field[(n+1)%N,m]+field[n,(m-1)%M]+field[n,(m+1)%M])

    if dE <= 0 :
        field[n,m] *= -1

    elif np.exp(-dE * beta) > np.random.rand():
        field[n,m] *= -1

我尝试使用 prange-构造函数，但我在使用 GIL 锁时遇到了很多麻烦。我是 Cython 和并行计算的新手，所以我很容易错过一些东西。

错误：

Discarding owned Python object not allowed without gil
Calling gil-requiring function not allowed without gil

Answer 1

从 Cython 的角度来看，主要问题是 cy_spin_flip 需要 GIL。您需要将 nogil 添加到其签名的末尾，并将 return 类型设置为 void （因为默认情况下它 return 是一个 Python 对象，它需要 GIL）。

但是，np.exp 和 np.random.rand 也需要 GIL，因为它们是 Python 函数调用。 np.exp 可能很容易被 libc.math.exp 代替。 np.random 有点难，但是对于基于 C 和 C++ 的方法有很多建议：1 2 （+ 其他）。

一个更根本的问题是这条线：

cdef float dE = 2*J*field[n,m]*(field[(n-1)%N,m]+field[(n+1)%N,m]+field[n,(m-1)%M]+field[n,(m+1)%M])

您已将此与 m 并行化（即 m 的不同值在不同线程中为运行），并且每次迭代都会更改 field。然而，在这一行中，您正在查找 m 的几个不同值。这意味着整个事情是一个竞争条件（结果取决于不同线程完成的顺序）并表明你的算法可能根本不适合并行化。或者你应该复制 field 并得到 field_in 和 field_out。这对我来说并不明显，但这是你应该能够解决的问题。

编辑： 看起来您确实在使用 i%2 时考虑了竞争条件。对我来说这并不明显，但这是正确的。我认为您的 "alternate cells" 方案的有效实施类似于：

for oddeven in range(2):
    for m in prange(M):
        for n in range(N):
            # some mechanism to pick the alternate cells here.

即您需要一个常规循环来选择并行循环之外的备用单元格。

Answer 2

Q : "How to use prange in cython?"^{. . . . + ( an Epilogue on True-[PARALLEL] True-randomness ... )}

简短版本： 最好的地方是性能提升。

更长的版本：
你的问题不是从避免 GIL 锁所有权开始，而是从几乎计算反模式的物理和性能损失开始，不管 cython-isation 可能启用的所有功能。

代码原样尝试在 {-1|+1}-自旋-[= 的整个二维域上应用 2D-kernel 运算符124=]field[N,M]，以快速和聪明的方式最好。

实际结果与 PHYSICAL FIELD ISING 不一致，因为“破坏性”-自我重写 field[n_,m] 权利的实际状态的技术“在“当前一代 [PAR][SEQ]-组织覆盖当前自旋值 field[:,:] 的二维域依次修改 field[i,j] 的状态，这显然不会发生在真实的公认的物理定律的世界。计算机不了解这些规则，我们人类应该不了解这些规则。

接下来，prange 尝试调用 ( M * N / 2 ) 次 cdef -ed cy_spin_flip() 在某种程度上，这可能很容易编码，但效率非常低，如果不是性能反模式测试，那么运行这种方式。

如果一个基准测试的调用成本约为 1E6-调用修复，以便与物理定律一致，cy_spin_flip()功能，人们直接看到每次调用开销的成本开始很重要，当以 prange-d 方式传递它们时（隔离，不协调，内存布局不可知的，几乎是原子的 memory-I/O 会破坏任何缓存/缓存行的一致性）。这是进入天真的 prange 的额外成本，而不是尝试进行一些 矢量化 / 块优化，memory-I/O 更智能的矩阵 / 内核处理。

使用二维核卷积的矢量化代码：

使用向量化大师@Divakar 提出的技巧，快速绘制的向量化代码可以在没有 CPU 的情况下每 ~ 3k3 [us] 生成一个步骤-架构调整和进一步调整 spin_2Dstate[200,200] :

初始状态是：

spin_2Dstate = np.random.randint( 2, size = N * M, dtype = np.int8 ).reshape( N, M ) * 2 - 1
# pre-allocate a memory-zone:
spin_2Dconv  = spin_2Dstate.copy()

实际const卷积核为：

spin_2Dkernel =  np.array( [ [ 0, 1, 0 ],
                             [ 1, 0, 1 ],
                             [ 0, 1, 0 ]
                             ],
                           dtype = np.int8 # [PERF] to be field-tested,
                           )               #        some architectures may get faster if matching CPU-WORD

实际的 CPU 架构可能会受益于智能对齐的数据类型，但对于更大的二维域 ~ [ > 200, > 200 ] 用户会发现，由于无用的 memory-I/O 花费，成本会增加在主要二进制 { -1 | +1 } 或什至更紧凑的位图存储的 8-B-rich 传输上 - { 0 | 1 } 自旋信息。

接下来，不是对每个 field[:,:]-cell 进行双循环调用，而是 block-更新完整的 2D -域一步，帮助者得到：

#                             T[:,:] * sum(?)
spin_2Dconv[:,:] = spin_2Dstate[:,:] * signal.convolve2d( spin_2Dstate,
                                                          spin_kernel,
                                                          boundary = 'wrap',
                                                          mode     = 'same'
                                                          )[:,:]

由于自旋内核属性中的物理特性，
这个辅助数组将仅包含 { -4 | -2 | 0 | +2 | +4 } 个值。

简化、快速的矢量代码：

 def aVectorisedSpinUpdateSTEPrandom( S           =  spin_2Dstate,
                                      C           =  spin_2Dconv,
                                      K           =  spin_2Dkernel,
                                      minus2betaJ = -2 * beta * J
                                      ):
        C[:,:] = S[:,:] * signal.convolve2d( S, K, boundary = 'wrap', mode = 'same' )[:,:]
        S[:,:] = S[:,:] * np.where( np.exp( C[:,:] * minus2betaJ ) > np.random.rand(), -1, 1 )

对于物理学无法识别在整个 2D 域中以相同值发生自旋翻转的统一概率的情况，替换从 np.random.rand()[ 产生的标量=151=] 具有从 np.random.rand( N, M )[:,:] 传递的（个性化 ^† ）概率的二维场，现在这将增加一些成本，达到一些 7k3 ~ 9k3 [us] 每个旋转更新步骤：

def aVectorisedSpinUpdateSTEPrand2D( S = spin_2Dstate, C = spin_2Dconv, K = spin_2Dkernel, minus2betaJ = -2 * beta * J ): C[:,:] = S[:,:] * signal.convolve2d( S, K, boundary = 'wrap', mode = 'same' )[:,:] S[:,:] = S[:,:] * np.where( np.exp( C[:,:] * minus2betaJ ) > np.random.rand( N, M ), -1, 1 ) >>> aClk.start(); aVectorisedSpinUpdateSTEPrand2D( spin_2Dstate, spin_2Dconv, spin_2Dkernel, -0.8 );aClk.stop() 7280 [us] 8984 [us] 9299 [us]

_{宽屏评论原样来源：}

// ###################################################################### Cython PARALLEL prange / GIL-lock issues related to randomness-generator state-space management if PRNG-s are "immersed"-inside the cpython realms # https://www.desmos.com/calculator/bgz9t3s3nm @cython.boundscheck( False ) # https://www.desmos.com/calculator/ttz3r735qy @cython.wraparound( False ) # def cy_ising_step( np.int64_t[:, :] field, # field[N,M] of INTs (spin) { +1 | -1 } so why int64_t [SPACE] 8-Bytes for a principal binary ? Or a complex128 for Quantum-state A*|1> + B*|0> ? float beta # beta: a float-factor ): # cdef int N = field.shape[0] # const cdef int M = field.shape[1] # const cdef int offset = np.random.randint( 0, 2 ) #_GIL-lock # const ??? NEVER RE-USED BUT IN THE NEXT const SETUP .... in pre-load const-s from external scope ??? an inital RANDOM-flip-MODE-choice-{0|1} cdef np.int64_t[:,] n_update = np.arange( offset, N, 2, dtype = np.int64 ) # const ??? 8-B far small int-s ?? ~ field[N,M] .......... being { either | or } == [ {0|1}, {2|3}, ... , { N-2 | N-1 } ] of { (S) | [L] } cdef int m, n, i, j # idxs{ (E) | [O] } # # for m in prange( M, nogil = True ): # [PAR]||||||||||||||||||||||||||||| m in M ||||||||| i = m % 2 # ||||||||||||||||||||||||| i = m % 2 ||||||||| ... { EVEN | ODD }-nodes for j in range( n_update.shape[0] ) : # [SEQ] j over ... ||||||||| ... over const ( N / 2 )-steps ~ [0,1,2,...,N/2-1] as idx2access n_update with {(S)|[L]}-indices # n = n_update[j] # n = n_update[j] ||||||||| # cy_spin_flip( field, ( n + i ) % N, m % M, beta ) # ||||||||| # ||||| # INCONGRUENT with PHYSICAL FIELD ISING ||||||||| # vvvvv # self-rewriting field[n_,m]"during" current generation of [PAR][SEQ]-organised coverage of 2D-field[:,:] pass; cy_spin_flip( field, ( n_update[j] + i ) % N, m % M, beta ) # modifies field[i,j] ??? WHY MODULO-FUSED ( _n + {0|1} ) % N, _m % M ops when ALL ( _n + {0|1} ) & _m ARE ALWAYS < N, M ???? i.e. remain self ? # # ||||||||| return np.array( field, dtype = np.int64 ) # ||||||||| RET? #||| cy_spin_flip( ) [PAR]|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| [PERF]: all complete call-overheads are paid M*N/2 times (just to do a case-switching) cdef cy_spin_flip( np.int64_t[:, :] field, # field[N,M] of ints (spin) { +1 | -1 } why int64_t 8-Bytes for a principal binary ? Or a complex128 for Quantum-state A*|1> + B*|0> ? int n, # const int int m, # const int float beta = 0.4, # const float ? is a pure positive scalar or can also be negative ? float J = 1.0 # const float ? is a pure positive scalar or can also be negative ? caller keeps this on an implicit, const == 1 value ): cdef int N = field.shape[0] # const int ? [PERF]: Why let this test & assignment ever happen to happen as-many-as-N*M-times - awfully expensive, once principally avoidable... cdef int M = field.shape[1] # const int ? [PERF]: Why let this test & assignment ever happen to happen as-many-as-N*M-times - awfully expensive, once principally avoidable... cdef float dE = ( 2 * J * field[ n, m ] # const float [?] [PERF]: FMUL 2, J to happen as-many-as-N*M-times - awfully expensive, once principally avoidable... *( field[( n - 1 ) % N, m ] # | (const) vvvv------------aSureSpinFLIP + field[( n + 1 ) % N, m ] # [?]-T[n,m]-[?] sum(?) *T *( 2*J ) the spin-game ~{ -1 | +1 } * sum( ? ) |::::| + field[ n, ( m - 1 ) % M] # | := {-8J |-4J | 0 | 4J | 8J } + field[ n, ( m + 1 ) % M] # [?] a T-dependent choice|__if_+T__| |__if_-T__| FLIP @random-scaled by 2*J*beta )# | | # ( % MODULO-fused OPs "skew" physics - as it "rolls-over" a 2D-field TOPOLOGY ) ) # | | # if dE <= 0 : # | | # field[ n, m ] *= -1 # [PERF]: "inverts" spin (EXPENSIVE FMUL instead of bitwise +1 or numpy-efficient block-wise XOR MASK) (2D-requires more efforts for best cache-eff'cy) elif ( np.exp( -dE * beta ) # | | # [PERF]: with a minusBETA, one MUL uop SAVED * M * N > np.random.rand() #__________|_____________|__________GIL-lock# [PERF]: pre-calc in the external-scope + [PHYSICS]: Does the "hidden"-SEQ-order here anyhow matter in realms of generally accepted laws of PHYSICS??? ): # | | # Is a warranty of the uniform distribution "lost" by an if(field-STATE)-governed sub-stepping ???? field[ n, m ] *= -1 # identical OP ? .OR.-ed in if(): ? of a pre-generated uniform-.rand() or a general (non-sub-stepped) sequenced stepping ???? # # in a stream-of-PRNG'd SPIN-FLIP threshold floats from a warranted uniform distrib. of values ????

物理学：

beta 控制（给定 const J）{ -8 | -4 | 0 | +4 | +8 } 的自旋翻转阈值模型，这是 ~ 2 * spin_2Dkernel-当前 spin_2Dstate 的整个 2D 域的卷积可在此处获得：https://www.desmos.com/calculator/bgz9t3s3nm 可以使用 beta 进行现场实验以查看任一可能的降低阈值正输出 { + 4 | + 8 }，因为 np.exp( -dE * 2 * J * beta ) 受到 beta 的强烈控制，并且 beta 越大，随机抽取数字的概率越低，保证来自半封闭范围[0, 1) 不会控制 np.exp()-结果。

† Post-Festum 的结语备注：

_{"Normally on a true Metropolis algorithm, you flip spins (chosen randomly) one by one. As I wanted to parallelize the algorithm I flip half the spins for each iteration (when the function cy_ising_step is called). Those spins are chosen in a way that none of thems are nearest neighbor as it would impact the Monte-Carlo optimization. This might not be a correct approach..."
– Angelo C}

感谢您对方法和选择的所有评论和详细信息。通过一对非“介入" 格子需要更谨慎地选择策略来获取随机性。

虽然使用“最具侵略性”的更新密度，但随机性来源是核心问题 - 不仅是整体处理性能（如何维护 FSA 状态本身的技术问题，如果求助于一个天真的中央 PRNG 源）。

你要么将你的过程设计为真正基于随机性（使用一些可用的非确定性熵来源），要么愿意服从于允许可重复实验的策略（用于重新检查）和科学计算的重新验证），为此您还有一项职责 - 此类科学实验的配置管理职责（记录/设置/分发/管理所有 PRNG-s 的初始“播种”，即科学计算实验配置为使用。

这里，鉴于自然保证自旋在 field[:,:] 的二维域中相互独立，时间箭头的方向应该是唯一的方向，其中（确定性的）- PRNG-s 可能会保留其对在 [0,1) 上保持均匀分布的输出的保证。作为其副作用，它们不会对它们各自内部状态的个体进化的并行化造成任何问题。答对了！计算成本低、HPC 级性能和稳健随机的 PRNG-s 是执行此操作的安全方法（请注意，如果还没有意识到，并非所有“COTS”PRNG-s 都“内置”所有这些属性）。

这意味着，当且仅当它从其“自己的”中获得自旋翻转决策阈值时，任何一个自旋都将保持公平且符合物理定律（因此一致自主地保持分布的均匀性输出）PRNG 实例（不是问题，但需要注意不要忘记它正确实施并且运行它有效）。

对于需要操作一个真正不确定的 PRNG 的情况，如果试图在其性能上限之外使用它，真正的 ND 熵源可能会成为性能瓶颈。在 （无论多大，但仍然） 有限状态自动机领域中，争取类自然熵是一项具有挑战性的任务，不是吗？

如何在cython中使用prange？

How to use prange in cython?

python

parallel-processing

physics

montecarlo

cython

使用二维核卷积的矢量化代码：

物理学：

† Post-Festum 的结语备注：

如何在cython中使用prange？

How to use prange in cython?

python

parallel-processing

physics

montecarlo

cython

使用二维核卷积的矢量化代码：

物理学：

† Post-Festum 的结语 备注：

† Post-Festum 的结语备注：