Julia

Question

我有一个维度 (n, m) 的矩阵 A 和一个维度 (n, p) 的矩阵 B。对于 n 行中的每一行，我想计算 A 行和 B 行之间的外积，它们是 (m, p) 矩阵。然后我有一个大小为 n 的向量 x，我想将这些矩阵中的每一个乘以 x 的相应条目并将所有内容相加。我该怎么做？

# Parameters
n, m, p = 100, 10, 3
# Matrices & Vectors
A, B, x = randn(n, m), randn(n, p), randn(n)
# Slow method
result = zeros(m, p)
for i in 1:n
    result += x[i] * (A[i, :] * B[i, :]')
end

Answer 1

这是另一个版本

# Parameters
n, m, p = 100, 10, 3

# Matrices & Vectors
A, B, x = randn(n, m), randn(n, p), randn(n)

function old_way(A, B, x)
    # Slow method
    result = zeros(m, p)
    for i in 1:n
        result += x[i] * (A[i, :] * B[i, :]')
    end
end

function another_way(A, B, x)
    sum(xi * (Arow * Brow') for (xi, Arow, Brow) in zip(x, eachrow(A), eachrow(B)))
end

和基准测试：

julia> using BenchmarkTools

julia> @benchmark old_way(A, B, x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   83.495 μs …   2.653 ms  ┊ GC (min … max): 0.00% … 95.79%
 Time  (median):      87.500 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   101.496 μs ± 115.196 μs  ┊ GC (mean ± σ):  6.46% ±  5.53%

  ▃█▇▆▂    ▁ ▁        ▁                                         ▁
  ███████████████▇█▇██████▆▇▇▇▆▇▇▇▇▇▇▇▇▆▇▆▆▇▆▇▅▆▆▆▄▅▅▅▅▆▅▆▅▅▅▄▅ █
  83.5 μs       Histogram: log(frequency) by time        200 μs <

 Memory estimate: 153.48 KiB, allocs estimate: 1802.


julia> @benchmark another_way(A, B, x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  25.850 μs … 923.032 μs  ┊ GC (min … max): 0.00% … 95.94%
 Time  (median):     27.477 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   31.851 μs ±  35.440 μs  ┊ GC (mean ± σ):  5.03% ±  4.49%

  ▇█▇▅▃▁           ▁▁▁                                         ▁
  ███████▇▇▆▇████████████▇█▇▇▇▆▇▆▅▆▆▆▅▆▆▅▆▆▇▆▅▅▅▆▅▆▅▅▅▄▃▅▅▅▄▄▄ █
  25.8 μs       Histogram: log(frequency) by time      77.4 μs <

 Memory estimate: 98.31 KiB, allocs estimate: 304.

所以速度快了一点，占用的内存也少了一点。

这是否回答了您的问题？

Answer 2

节省时间和内存的一般提示：

将代码放在方法中而不是全局范围内，并确保该函数中的每个变量都来自参数，而不是全局变量。这样，Julia 的编译器就可以推断变量的类型并进行优化。
尽可能减少分配，您就有很多机会。此处的更改区分了 old_way 和 new_way 方法，它导致 5-6 倍的加速和减少到 1 个分配。
- 切片数组时，使用 @view 以避免分配副本的默认行为。
- 您可以使用 .+= 就地更改 result。 += 分配一个新数组并将变量 result 重新分配给它。
- 对于像 x[i] * ... 这样的逐元素运算，链接点运算符融合了底层的逐元素循环并减少了中间数组的分配。
- 列 (Mx1) 向量和行 (1xN) 向量的矩阵乘法可以简化为逐元素乘法。

n, m, p = 100, 10, 3
A, B, x = randn(n, m), randn(n, p), randn(n)

# Methods below do not use the above global variables

function old_way(A, B, x, n, m, p)
  result = zeros(m, p)
  for i in 1:n
    result += x[i] * (A[i, :] * B[i, :]')
  end
  result
end

function new_way(A, B, x, n, m, p)
  result = zeros(m, p)
  for i in 1:n
    result .+= x[i] .* ( @view(A[i, :]) .* @view(B[i, :])' )
  end
  result
end

using BenchmarkTools

@btime old_way(A, B, x, n, m, p);
# 36.753 μs (501 allocations: 125.33 KiB)

@btime new_way(A, B, x, n, m, p);
# 6.542 μs (1 allocation: 336 bytes)

old_way(A, B, x, n, m, p) == new_way(A, B, x, n, m, p)
# true

到目前为止，上面的示例避免了全局变量，下面的示例将说明原因。即使你把你的代码放在一个方法中但仍然使用全局变量，不仅性能通常更差，试图减少分配 适得其反:

# Methods below use n, m, p as global inputs

function old_oops(A, B, x)
  # same code as old_way(A, B, x, n, m, p)
end

function new_oops(A, B, x)
  # same code as new_way(A, B, x, n, m, p)
end

@btime old_oops(A, B, x);
# 95.317 μs (1802 allocations: 153.48 KiB)

@btime new_oops(A, B, x);
# 235.191 μs (1302 allocations: 81.61 KiB)

Answer 3

如果您的设置与 MWE 具有相同的结构，则使用 LinearAlgebra：

faster(A,B,x) = (diagm(x)*A)'*B

运行速度快 4 倍：

using LinearAlgebra, BenchmarkTools

# Parameters
n, m, p = 100, 10, 3
# Matrices & Vectors
A, B, x = randn(n, m), randn(n, p), randn(n)
# Slow method

function slow(A,B,x)
    result = zeros(m, p)
    for i in 1:n
        result += x[i] * (A[i, :] * B[i, :]')
    end
    result
end

faster(A,B,x) = (diagm(x)*A)'*B

@assert(slow(A,B,x) ≈ faster(A,B,x))
@btime slow($A,$B,$x)     # 113.400 μs (1802 allocations: 139.39 KiB)
@btime faster($A,$B,$x)   # 28.100 μs (4 allocations: 86.41 KiB)

Julia - 行外积的线性组合

Julia - Linear combination of row-wise outer products

linear-algebra