让 Julia SharedArrays 与 Sun Grid Engine 完美搭配
Getting Julia SharedArrays to play nicely with Sun Grid Engine
我一直在尝试让 Julia 程序在 SharedArray
的 SGE 环境中正确地 运行。我在 Julia 和 SGE 上阅读了几个主题,但其中大部分似乎都与 MPI 打交道。 this Gist 中的函数 bind_pe_procs()
似乎正确地将进程绑定到本地环境。像
这样的脚本
### define bind_pe_procs() as in Gist
### ...
println("Started julia")
bind_pe_procs()
println("do SharedArrays initialize correctly?")
x = SharedArray(Float64, 3, pids = procs(), init = S -> S[localindexes(S)] = 1.0)
pids = procs(x)
println("number of workers: ", length(procs()))
println("SharedArrays map to ", length(pids), " workers")
产生以下输出:
starting qsub script file
Mon Oct 12 15:13:38 PDT 2015
calling mpirun now
exception on 2: exception on exception on 4: exception on exception on 53: : exception on exception on exception on Started julia
parsing PE_HOSTFILE
[{"name"=>"compute-0-8.local","n"=>"5"}]compute-0-8.local
ASCIIString["compute-0-8.local","compute-0-8.local","compute-0-8.local","compute-0-8.local"]adding machines to current system
done
do SharedArrays initialize correctly?
number of workers: 5
SharedArrays map to 5 workers
奇怪的是,如果我需要从文件加载数组并使用命令 convert(SharedArray, vec(readdlm(FILEPATH)))
转换为 SharedArray
格式,这似乎不起作用。如果脚本是
println("Started julia")
bind_pe_procs()
### script reads arrays from file and converts to SharedArrays
println("running script...")
my_script()
那么结果就是垃圾:
starting qsub script file
Mon Oct 19 09:18:29 PDT 2015
calling mpirun now Started julia
parsing PE_HOSTFILE
[{"name"=>"compute-0-5.local","n"=>"11"}]compute-0-5.local
ASCIIString["compute-0-5.local","compute-0-5.local","compute-0-5.local","compute-0-5.local","compute-0-5.local","compute-0-5.local","compute-0-5.local","compute-0-5.local","compute-0-5.local","compute-0- 5.local"]adding machines to current system
done
running script...
Current number of processes: [1,2,3,4,5,6,7,8,9,10,11]
SharedArray y is seen by [1] processes
### tons of errors here
### important one is "SharedArray cannot be used on a non-participating process"
所以 SharedArrays 无法正确映射到所有内核。有人对此问题有任何建议或见解吗?
我在工作中使用的一种解决方法是简单地强制 SGE 将作业提交到特定节点,然后将并行环境限制为我想要使用的内核数。
下面我提供了一个24核节点的SGE qsub脚本,我只想使用6个核。
#!/bin/bash
# lots of available SGE script options, only relevant ones included below
# request processes in parallel environment
#$ -pe orte 6
# use this command to dump job on a particular queue/node
#$ -q all.q@compute-0-13
/share/apps/julia-0.4.0/bin/julia -p 5 MY_SCRIPT.jl
Pro:与 SharedArray
s 配合得很好。
缺点:作业将在队列中等待,直到节点有足够的核心可用。
我一直在尝试让 Julia 程序在 SharedArray
的 SGE 环境中正确地 运行。我在 Julia 和 SGE 上阅读了几个主题,但其中大部分似乎都与 MPI 打交道。 this Gist 中的函数 bind_pe_procs()
似乎正确地将进程绑定到本地环境。像
### define bind_pe_procs() as in Gist
### ...
println("Started julia")
bind_pe_procs()
println("do SharedArrays initialize correctly?")
x = SharedArray(Float64, 3, pids = procs(), init = S -> S[localindexes(S)] = 1.0)
pids = procs(x)
println("number of workers: ", length(procs()))
println("SharedArrays map to ", length(pids), " workers")
产生以下输出:
starting qsub script file
Mon Oct 12 15:13:38 PDT 2015
calling mpirun now
exception on 2: exception on exception on 4: exception on exception on 53: : exception on exception on exception on Started julia
parsing PE_HOSTFILE
[{"name"=>"compute-0-8.local","n"=>"5"}]compute-0-8.local
ASCIIString["compute-0-8.local","compute-0-8.local","compute-0-8.local","compute-0-8.local"]adding machines to current system
done
do SharedArrays initialize correctly?
number of workers: 5
SharedArrays map to 5 workers
奇怪的是,如果我需要从文件加载数组并使用命令 convert(SharedArray, vec(readdlm(FILEPATH)))
转换为 SharedArray
格式,这似乎不起作用。如果脚本是
println("Started julia")
bind_pe_procs()
### script reads arrays from file and converts to SharedArrays
println("running script...")
my_script()
那么结果就是垃圾:
starting qsub script file
Mon Oct 19 09:18:29 PDT 2015
calling mpirun now Started julia
parsing PE_HOSTFILE
[{"name"=>"compute-0-5.local","n"=>"11"}]compute-0-5.local
ASCIIString["compute-0-5.local","compute-0-5.local","compute-0-5.local","compute-0-5.local","compute-0-5.local","compute-0-5.local","compute-0-5.local","compute-0-5.local","compute-0-5.local","compute-0- 5.local"]adding machines to current system
done
running script...
Current number of processes: [1,2,3,4,5,6,7,8,9,10,11]
SharedArray y is seen by [1] processes
### tons of errors here
### important one is "SharedArray cannot be used on a non-participating process"
所以 SharedArrays 无法正确映射到所有内核。有人对此问题有任何建议或见解吗?
我在工作中使用的一种解决方法是简单地强制 SGE 将作业提交到特定节点,然后将并行环境限制为我想要使用的内核数。
下面我提供了一个24核节点的SGE qsub脚本,我只想使用6个核。
#!/bin/bash
# lots of available SGE script options, only relevant ones included below
# request processes in parallel environment
#$ -pe orte 6
# use this command to dump job on a particular queue/node
#$ -q all.q@compute-0-13
/share/apps/julia-0.4.0/bin/julia -p 5 MY_SCRIPT.jl
Pro:与 SharedArray
s 配合得很好。
缺点:作业将在队列中等待,直到节点有足够的核心可用。