BLAS 与 Julia SharedArray 对象的并行更新
BLAS v. parallel updates for Julia SharedArray objects
我有兴趣将 Julia SharedArray
用于科学计算项目。我当前的实现对所有矩阵向量运算都使用 BLAS,但我认为 SharedArray
可能会在多核机器上提供一些加速。我的想法是简单地逐个索引更新输出向量,将索引更新分配给工作进程。
之前关于共享内存对象的讨论here about SharedArray
s and here并未就此问题提供明确的指导。直觉上看起来很简单,但经过测试后,我对为什么这种方法效果如此差感到有些困惑(请参见下面的代码)。对于初学者来说,@parallel for
似乎分配了大量内存。如果我在循环前面加上 @sync
前缀,如果稍后需要整个输出向量,这似乎是一个聪明的做法,那么并行循环就会慢得多(尽管没有 @sync
,循环是强大的快)。
我是否错误地解释了 SharedArray
对象的正确使用?还是我分配的计算效率低下?
### test for speed gain w/ SharedArray vs. Array ###
# problem dimensions
n = 10000; p = 25000
# set BLAS threads; 64 seems reasonable in testing
blas_set_num_threads(64)
# make normal Arrays
x = randn(n,p)
y = ones(p)
z = zeros(n)
# make SharedArrays
X = convert(SharedArray{Float64,2}, x)
Y = convert(SharedArray{Float64,1}, y)
Z = convert(SharedArray{Float64,1}, z)
# run BLAS.gemv! on Arrays twice, time second case
BLAS.gemv!('N', 1.0, x, y, 0.0, z)
@time BLAS.gemv!('N', 1.0, x, y, 0.0, z)
# does BLAS work equally well for SharedArrays?
# check timing result and ensure same answer
BLAS.gemv!('N', 1.0, X, Y, 0.0, Z)
@time BLAS.gemv!('N', 1.0, X, Y, 0.0, Z)
println("$(isequal(z,Z))") # should be true
# SharedArrays can be updated in parallel
# code a loop to farm updates to worker nodes
# use transposed X to place rows of X in columnar format
# should (hopefully) help with performance issues from stride
Xt = X'
@parallel for i = 1:n
Z[i] = dot(Y, Xt[:,i])
end
# now time the synchronized copy of this
@time @sync @parallel for i = 1:n
Z[i] = dot(Y, Xt[:,i])
end
# still get same result?
println("$(isequal(z,Z))") # should be true
4 个工作节点 + 1 个主节点的测试输出:
elapsed time: 0.109010169 seconds (80 bytes allocated)
elapsed time: 0.110858551 seconds (80 bytes allocated)
true
elapsed time: 1.726231048 seconds (119936 bytes allocated)
true
您 运行 遇到了几个问题,其中最重要的是 Xt[:,i]
创建了一个新数组(分配内存)。这是一个让您更接近您想要的演示:
n = 10000; p = 25000
# make normal Arrays
x = randn(n,p)
y = ones(p)
z = zeros(n)
# make SharedArrays
X = convert(SharedArray, x)
Y = convert(SharedArray, y)
Z = convert(SharedArray, z)
Xt = X'
@everywhere function dotcol(a, B, j)
length(a) == size(B,1) || throw(DimensionMismatch("a and B must have the same number of rows"))
s = 0.0
@inbounds @simd for i = 1:length(a)
s += a[i]*B[i,j]
end
s
end
function run1!(Z, Y, Xt)
for j = 1:size(Xt, 2)
Z[j] = dotcol(Y, Xt, j)
end
Z
end
function runp!(Z, Y, Xt)
@sync @parallel for j = 1:size(Xt, 2)
Z[j] = dotcol(Y, Xt, j)
end
Z
end
run1!(Z, Y, Xt)
runp!(Z, Y, Xt)
@time run1!(Z, Y, Xt)
zc = copy(sdata(Z))
fill!(Z, -1)
@time runp!(Z, Y, Xt)
@show sdata(Z) == zc
结果(开始时julia -p 8
):
julia> include("/tmp/paralleldot.jl")
elapsed time: 0.465755791 seconds (80 bytes allocated)
elapsed time: 0.076751406 seconds (282 kB allocated)
sdata(Z) == zc = true
为了比较,当 运行 在同一台机器上时:
julia> blas_set_num_threads(8)
julia> @time A_mul_B!(Z, X, Y);
elapsed time: 0.067611858 seconds (80 bytes allocated)
所以原始的 Julia 实现至少可以与 BLAS 竞争。
我有兴趣将 Julia SharedArray
用于科学计算项目。我当前的实现对所有矩阵向量运算都使用 BLAS,但我认为 SharedArray
可能会在多核机器上提供一些加速。我的想法是简单地逐个索引更新输出向量,将索引更新分配给工作进程。
之前关于共享内存对象的讨论here about SharedArray
s and here并未就此问题提供明确的指导。直觉上看起来很简单,但经过测试后,我对为什么这种方法效果如此差感到有些困惑(请参见下面的代码)。对于初学者来说,@parallel for
似乎分配了大量内存。如果我在循环前面加上 @sync
前缀,如果稍后需要整个输出向量,这似乎是一个聪明的做法,那么并行循环就会慢得多(尽管没有 @sync
,循环是强大的快)。
我是否错误地解释了 SharedArray
对象的正确使用?还是我分配的计算效率低下?
### test for speed gain w/ SharedArray vs. Array ###
# problem dimensions
n = 10000; p = 25000
# set BLAS threads; 64 seems reasonable in testing
blas_set_num_threads(64)
# make normal Arrays
x = randn(n,p)
y = ones(p)
z = zeros(n)
# make SharedArrays
X = convert(SharedArray{Float64,2}, x)
Y = convert(SharedArray{Float64,1}, y)
Z = convert(SharedArray{Float64,1}, z)
# run BLAS.gemv! on Arrays twice, time second case
BLAS.gemv!('N', 1.0, x, y, 0.0, z)
@time BLAS.gemv!('N', 1.0, x, y, 0.0, z)
# does BLAS work equally well for SharedArrays?
# check timing result and ensure same answer
BLAS.gemv!('N', 1.0, X, Y, 0.0, Z)
@time BLAS.gemv!('N', 1.0, X, Y, 0.0, Z)
println("$(isequal(z,Z))") # should be true
# SharedArrays can be updated in parallel
# code a loop to farm updates to worker nodes
# use transposed X to place rows of X in columnar format
# should (hopefully) help with performance issues from stride
Xt = X'
@parallel for i = 1:n
Z[i] = dot(Y, Xt[:,i])
end
# now time the synchronized copy of this
@time @sync @parallel for i = 1:n
Z[i] = dot(Y, Xt[:,i])
end
# still get same result?
println("$(isequal(z,Z))") # should be true
4 个工作节点 + 1 个主节点的测试输出:
elapsed time: 0.109010169 seconds (80 bytes allocated)
elapsed time: 0.110858551 seconds (80 bytes allocated)
true
elapsed time: 1.726231048 seconds (119936 bytes allocated)
true
您 运行 遇到了几个问题,其中最重要的是 Xt[:,i]
创建了一个新数组(分配内存)。这是一个让您更接近您想要的演示:
n = 10000; p = 25000
# make normal Arrays
x = randn(n,p)
y = ones(p)
z = zeros(n)
# make SharedArrays
X = convert(SharedArray, x)
Y = convert(SharedArray, y)
Z = convert(SharedArray, z)
Xt = X'
@everywhere function dotcol(a, B, j)
length(a) == size(B,1) || throw(DimensionMismatch("a and B must have the same number of rows"))
s = 0.0
@inbounds @simd for i = 1:length(a)
s += a[i]*B[i,j]
end
s
end
function run1!(Z, Y, Xt)
for j = 1:size(Xt, 2)
Z[j] = dotcol(Y, Xt, j)
end
Z
end
function runp!(Z, Y, Xt)
@sync @parallel for j = 1:size(Xt, 2)
Z[j] = dotcol(Y, Xt, j)
end
Z
end
run1!(Z, Y, Xt)
runp!(Z, Y, Xt)
@time run1!(Z, Y, Xt)
zc = copy(sdata(Z))
fill!(Z, -1)
@time runp!(Z, Y, Xt)
@show sdata(Z) == zc
结果(开始时julia -p 8
):
julia> include("/tmp/paralleldot.jl")
elapsed time: 0.465755791 seconds (80 bytes allocated)
elapsed time: 0.076751406 seconds (282 kB allocated)
sdata(Z) == zc = true
为了比较,当 运行 在同一台机器上时:
julia> blas_set_num_threads(8)
julia> @time A_mul_B!(Z, X, Y);
elapsed time: 0.067611858 seconds (80 bytes allocated)
所以原始的 Julia 实现至少可以与 BLAS 竞争。