CPU 对 MOSEK 使用 Python API 的亲和力问题

CPU affinity issue using Python API for MOSEK

我在 MOSEK. My program parallelizes using the multiprocessing module in Python, thus MOSEK is running concurrently on each process. The machine has 48 cores so I run 48 concurrent processes using the Pool class. Their documentation states that the API is thread safe 中遇到 CPU 亲和性和线性整数规划问题。

程序启动后,下面是top的输出。它表明约 50% 的 CPU 是空闲的。仅显示顶部输出的前 20 行。

top - 22:04:42 up 5 days, 14:38,  3 users,  load average: 10.67, 13.65, 6.29
Tasks: 613 total,  47 running, 566 sleeping,   0 stopped,   0 zombie
%Cpu(s): 46.3 us,  3.8 sy,  0.0 ni, 49.2 id,  0.7 wa,  0.0 hi,  0.0 si,  0.0 st
GiB Mem:   503.863 total,  101.613 used,  402.250 free,    0.482 buffers
GiB Swap:   61.035 total,    0.000 used,   61.035 free.   96.250 cached Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
115517 njmeyer   20   0  171752  27912  11632 R  98.7  0.0   0:02.52 python
115522 njmeyer   20   0  171088  27472  11632 R  98.7  0.0   0:02.79 python
115547 njmeyer   20   0  171140  27460  11568 R  98.7  0.0   0:01.82 python
115550 njmeyer   20   0  171784  27880  11568 R  98.7  0.0   0:01.64 python
115540 njmeyer   20   0  171136  27456  11568 R  92.5  0.0   0:01.91 python
115551 njmeyer   20   0  371636  31100  11632 R  92.5  0.0   0:02.93 python
115539 njmeyer   20   0  171132  27452  11568 R  80.2  0.0   0:01.97 python
115515 njmeyer   20   0  171748  27908  11632 R  74.0  0.0   0:03.02 python
115538 njmeyer   20   0  171128  27512  11632 R  74.0  0.0   0:02.51 python
115558 njmeyer   20   0  171144  27528  11632 R  74.0  0.0   0:02.28 python
115554 njmeyer   20   0  527980  28728  11632 R  67.8  0.0   0:02.15 python
115524 njmeyer   20   0  527956  28676  11632 R  61.7  0.0   0:02.34 python
115526 njmeyer   20   0  527956  28704  11632 R  61.7  0.0   0:02.80 python

我检查了文档的 MOSEK parameters 部分,但没有看到任何与 CPU affinity 相关的内容。他们有一些与优化器中的多线程相关的标志。这些标志默认设置为 off,当冗余设置为 off 时没有变化。

我检查了 运行 宁 python 个作业的 cpu 亲和力,其中许多都绑定到相同的 cpu。但是,奇怪的是我无法设置 cpu affinity,或者至少它似乎在我更改后很快又被更改了。

我选择了其中一份工作,并通过 运行ning taskset -p 0xFFFFFFFFFFFF 115526 设置了 cpu 亲和力。我这样做 10 次,中间间隔 1 秒。这是每次 taskset 调用后的 cpu 亲和掩码。

pid 115526's current affinity mask: 10
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 7
pid 115526's current affinity mask: 800000000000
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 0-47
pid 115526's current affinity mask: 800000000000
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 0-47
pid 115526's current affinity mask: ffffffffffff
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 0-47
pid 115526's current affinity mask: ffffffffffff
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 0-47
pid 115526's current affinity mask: ffffffffffff
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 0-47
pid 115526's current affinity mask: 200000000000
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 47
pid 115526's current affinity mask: ffffffffffff
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 0-47
pid 115526's current affinity mask: 800000000000
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 0-47
pid 115526's current affinity mask: 800000000000
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 0-47

似乎有什么东西在 运行 时间内不断改变 cpu 亲和力。

我也试过设置父进程的cpu亲和力,效果一样

这是我的代码 运行ning.

import mosek
import sys
import cPickle as pickle
import multiprocessing
import time

def mosekOptim(aCols,aVals,b,c,nCon,nVar,numTrt):
    """Solve the linear integer program.


    Solve the program
    max c' x
    s.t. Ax <= b

    """

    ## setup mosek
    with mosek.Env() as env, env.Task() as task:
        task.appendcons(nCon)
        task.appendvars(nVar)
        inf = float("inf")


        ## c
        for j,cj in enumerate(c):
            task.putcj(j,cj)


        ## bounds on A
        bkc = [mosek.boundkey.fx] + [mosek.boundkey.up
                                     for i in range(nCon-1)]

        blc = [float(numTrt)] + [-inf for i in range(nCon-1)]
        buc = b


        ## bounds on x
        bkx = [mosek.boundkey.ra for i in range(nVar)]
        blx = [0.0]*nVar
        bux = [1.0]*nVar

        for j,a in enumerate(zip(aCols,aVals)):
            task.putarow(j,a[0],a[1])

        for j,bc in enumerate(zip(bkc,blc,buc)):
            task.putconbound(j,bc[0],bc[1],bc[2])

        for j,bx in enumerate(zip(bkx,blx,bux)):
            task.putvarbound(j,bx[0],bx[1],bx[2])

        task.putobjsense(mosek.objsense.maximize)

        ## integer type
        task.putvartypelist(range(nVar),
                            [mosek.variabletype.type_int
                             for i in range(nVar)])

        task.optimize()

        task.solutionsummary(mosek.streamtype.msg)

        prosta = task.getprosta(mosek.soltype.itg)
        solsta = task.getsolsta(mosek.soltype.itg)

        xx = mosek.array.zeros(nVar,float)
        task.getxx(mosek.soltype.itg,xx)

    if solsta not in [ mosek.solsta.integer_optimal,
                   mosek.solsta.near_integer_optimal ]:
        print "".join(mosekMsg)
        raise ValueError("Non optimal or infeasible.")
    else:
        return xx


def reps(secs,*args):
    start = time.time()
    while time.time() - start < secs:
        for i in range(100):
            mosekOptim(*args)


def main():
    with open("data.txt","r") as f:
        data = pickle.loads(f.read())

    args = (60,) + data

    pool = multiprocessing.Pool()
    jobs = []
    for i in range(multiprocessing.cpu_count()):
        jobs.append(pool.apply_async(reps,args=args))
    pool.close()
    pool.join()

if __name__ == "__main__":
    main()

代码取消了我预先计算的数据。这些对象是线性程序的约束和系数。我在这个 repository.

中托管了代码和这个数据文件

有没有其他人在使用 MOSEK 时遇到过这种行为?对如何进行有什么建议吗?

我联系了支持人员,他们建议将 MSK_IPAR_NUM_THREADS 设置为 1。我的问题只需要几分之一秒就可以解决,所以它看起来不像是在使用多核。应该检查文档的默认值。

在我的代码中,我在 with 语句之后添加了 task.putintparam(mosek.iparam.num_threads,1)。这解决了问题。