最小化涉及大加权矩阵的平方和的最快方法
Fastest way to minimize sum of squares involving large weighted matrices
我需要最小化两个大(10000 x 40000)矩阵之间的平方和:Σ(X-A)^2 其中
X 是两个矩阵(10000 x 20000)的串联,每个段都单独加权(W),内部函数参见图片。。
还有一个约束,权重之和必须等于1 (W1 + W2 = 1)。我目前正在 scipy 优化中使用 SLSQP 方法来获得最佳权重值,但在我的计算机上处理大约需要 10 分钟。有没有更好的方法可以做到这一点而不需要这么长时间?我还在下面附上了我正在使用的代码。
from scipy.optimize import minimize
import numpy
def objective(W,X1,X2,A):
W1=W[0]
W2=W[1]
X=numpy.hstack((W1*X1,W2*X2))
return numpy.sum((X-A)**2)
def constraint1(W):
W1=W[0]
W2=W[1]
return W1+W2-1
x0=[[1,0]]
cons = {'type': 'eq', 'fun':constraint1}
#Random data only used for purposes of example
segment_1 = numpy.random.rand(10000, 20000)
segment_2 = numpy.random.rand(10000, 20000)
A = numpy.random.rand(10000, 40000)
sol=minimize(objective,x0[0],args=(segment_1,segment_2,A),method='SLSQP',constraints=cons)
简介
好吧...有大量潜在改进,我只是触及阻力最小的路径(由于懒惰,因为信息稀疏)。
如果您非常关心性能,请考虑如何快速计算梯度(符号/代数;没有隐藏的数值微分,就像在基于 SLSQP 的代码中所做的那样)。
提示: 比较 SLSQP 的 nfev
和 nit
以查看数值微分的开销!
方法
- 您只对优化两个变量的凸组合感兴趣
- 这很容易映射到标量优化
- 优化
W1
(有界)并且隐含 W2
!
代码(未优化的 hack)
原装配件
from scipy.optimize import minimize
import numpy
numpy.random.seed(0)
from time import perf_counter as pc
"""
ORIGNAL CODE
"""
def objective(W,X1,X2,A):
W1=W[0]
W2=W[1]
X=numpy.hstack((W1*X1,W2*X2))
return numpy.sum((X-A)**2)
def constraint1(W):
W1=W[0]
W2=W[1]
return W1+W2-1
x0=[[1,0]]
cons = {'type': 'eq', 'fun':constraint1}
#Random data only used for purposes of example
segment_1 = numpy.random.rand(1000, 20000)
segment_2 = numpy.random.rand(1000, 20000)
A = numpy.random.rand(1000, 40000)
# MODIFICATION: make instance more interesting! != ~0.5/0.5 solution
segment_1 *= 1.3
start_time = pc()
sol = minimize(objective,x0[0],args=(segment_1,segment_2,A),method='SLSQP',constraints=cons)
end_time = pc()
print(sol)
print('secs: {}'.format(end_time - start_time))
修改部分(同上source/file)
"""
ALTERNATIVE CODE hacked around ORIGINAL CODE
"""
from scipy.optimize import minimize_scalar
def solve_alternative(ojective_func, segment_1, segment_2, A):
objective_func_wrapper = lambda x: ojective_func(
(x, 1.0-x), segment_1, segment_2, A)
x = minimize_scalar(objective_func_wrapper, method='Bounded', bounds=(0, 1))
return x
start_time = pc()
sol = solve_alternative(objective, segment_1, segment_2, A)
end_time = pc()
print(sol)
print('secs: {}'.format(end_time - start_time))
示例输出(在类似平板电脑的设备上尺寸缩小到 运行)
fun: 6280656.612055224
jac: array([-2736633.5 , -2736633.375])
message: 'Optimization terminated successfully.'
nfev: 19
nit: 3
njev: 3
status: 0
success: True
x: array([0.45541614, 0.54458386])
secs: 32.6327816
fun: 6280656.612055217
message: 'Solution found.'
nfev: 6
status: 0
success: True
x: 0.45541616584551015
secs: 9.930487200000002
虽然我利用了另一个答案中只有两个变量,但这里我们专注于进行有效的函数评估!
利用 objective 固有的简单性允许重用原始的基于 SLSQP 的优化,同时准备好额外的 段/变量 将来(如评论中所示)只要结构保持不变。
优化成本应该大约等于单个函数评估的成本!
想法
- 由于潜在的内存分配和数组副本,原始函数尽可能低效(例如
np.stack()
)
- 可以通过没有内存操作的存储顺序并行评估来改进(参见代码)
- 由于涉及大量数据,通常评估成本很高(在某些优化器的每次迭代中)
- 可以重新制定!
- 尽可能多地进行先验计算,以加快计算速度,具体取决于
仅限优化变量!
改写基本上是遵循 WolframAlpha:
备注:
- 任务以矩阵形式呈现,但它实际上只是一个扁平化/逐元素计算
- 这在向 WolframAlpha 询问 ideas 之前被直接利用:见输入!
- 比较:
W1 = x
W2 = y
X1 = v_i
(但 1d)
X2 = w_i
(但 1d)
A = a_i
和b_i
(分解+1d)
代码
import numpy as np
from scipy.optimize import minimize
from time import perf_counter as pc
np.random.seed(0)
# random fake-data
# ################
segment_1 = np.random.rand(5000, 10000) * 7.13
segment_2 = np.random.rand(5000, 10000) * np.random.normal(size=(5000, 10000))
A = np.random.rand(5000, 20000)
# constraint: probability-simplex
# ###############################
def constraint(x):
return np.sum(x) - 1.
# objective
# #########
# original -> very inefficient due to lots of potential memcopy
# -------------------------------------------------------------
def objective(x):
W1=x[0]
W2=x[1]
X=np.hstack((W1*segment_1, W2*segment_2))
return np.sum((X-A)**2)
# modified -> (hopefully) no memory-allocation at all; (hopefully) storage-order parallel iteration
# -------------------------------------------------------------------------------------------------
def objective_perf(x):
return np.sum( ((x[0] * segment_1) - A[:, :segment_1.shape[1]])**2 ) \
+ np.sum( ((x[1] * segment_2) - A[:, segment_1.shape[1]:])**2 )
# heavily reformulated
# ####################
start_time = pc()
# pre-calc: flatten out matrices as we are doing element-wise reasoning anyway
flat_A_segment_A = A[:, :segment_1.shape[1]].ravel()
flat_A_segment_B = A[:, segment_1.shape[1]:].ravel()
flat_segment_A = segment_1.ravel()
flat_segment_B = segment_2.ravel()
# pre-calc: sub-expressions (see WolframAlpha image!) / sum_squares(vec) = np.dot(vec, vec)
comp_0 = np.dot(flat_A_segment_A, flat_A_segment_A) + np.dot(flat_A_segment_B, flat_A_segment_B)
comp_1 = -2 * np.dot(flat_A_segment_A, flat_segment_A)
comp_2 = -2 * np.dot(flat_A_segment_B, flat_segment_B)
comp_3 = np.dot(flat_segment_A, flat_segment_A)
comp_4 = np.dot(flat_segment_B, flat_segment_B)
end_time = pc()
print('pre-calc secs: {}\n'.format(end_time - start_time))
# pre-calc based function-eval / basically *for free*
def objective_high_perf(x):
return comp_0 + x[0] * comp_1 + x[1] * comp_2 + x[0]**2 * comp_3 + x[1]**2 * comp_4
# SLSQP-based solving
# -------------------
cons = {'type': 'eq', 'fun': constraint}
x0 = [1.0, 0.0]
print('-----\nVariant: "objective"\n-----')
start_time = pc()
sol = minimize(objective_perf, x0, method='SLSQP', constraints=cons)
end_time = pc()
print(sol)
print('secs: {}\n'.format(end_time - start_time))
print('-----\nVariant: "objective_perf"\n-----')
start_time = pc()
sol = minimize(objective_perf, x0, method='SLSQP', constraints=cons)
end_time = pc()
print(sol)
print('secs: {}\n'.format(end_time - start_time))
print('-----\nVariant: "objective_high_perf"\n-----')
start_time = pc()
sol = minimize(objective_high_perf, x0, method='SLSQP', constraints=cons)
end_time = pc()
print(sol)
print('secs: {}\n'.format(end_time - start_time))
输出(由于类似平板设备的内存,不是完整数据)
pre-calc secs: 1.1280025999999999
-----
Variant: "objective"
-----
fun: 37044840.62293503
jac: array([29253964., 29253786.])
message: 'Optimization terminated successfully'
nfev: 16
nit: 2
njev: 2
status: 0
success: True
x: array([0.12245548, 0.87754452])
secs: 49.2468216
-----
Variant: "objective_perf"
-----
fun: 37044840.62293503
jac: array([29253964., 29253786.])
message: 'Optimization terminated successfully'
nfev: 16
nit: 2
njev: 2
status: 0
success: True
x: array([0.12245548, 0.87754452])
secs: 49.036501799999996
-----
Variant: "objective_high_perf"
-----
fun: 37044840.622934975
jac: array([29253784. , 29253777.5])
message: 'Optimization terminated successfully'
nfev: 15
nit: 2
njev: 2
status: 0
success: True
x: array([0.12245547, 0.87754453])
secs: 0.010043600000003039
预期
我猜你的 10 分钟 运行 现在应该小于 10 秒。
在我的示例中,~50 秒 已减少到~1.13 + ~0.01 = ~1.14 秒
我需要最小化两个大(10000 x 40000)矩阵之间的平方和:Σ(X-A)^2 其中
X 是两个矩阵(10000 x 20000)的串联,每个段都单独加权(W),内部函数参见图片。
还有一个约束,权重之和必须等于1 (W1 + W2 = 1)。我目前正在 scipy 优化中使用 SLSQP 方法来获得最佳权重值,但在我的计算机上处理大约需要 10 分钟。有没有更好的方法可以做到这一点而不需要这么长时间?我还在下面附上了我正在使用的代码。
from scipy.optimize import minimize
import numpy
def objective(W,X1,X2,A):
W1=W[0]
W2=W[1]
X=numpy.hstack((W1*X1,W2*X2))
return numpy.sum((X-A)**2)
def constraint1(W):
W1=W[0]
W2=W[1]
return W1+W2-1
x0=[[1,0]]
cons = {'type': 'eq', 'fun':constraint1}
#Random data only used for purposes of example
segment_1 = numpy.random.rand(10000, 20000)
segment_2 = numpy.random.rand(10000, 20000)
A = numpy.random.rand(10000, 40000)
sol=minimize(objective,x0[0],args=(segment_1,segment_2,A),method='SLSQP',constraints=cons)
简介
好吧...有大量潜在改进,我只是触及阻力最小的路径(由于懒惰,因为信息稀疏)。
如果您非常关心性能,请考虑如何快速计算梯度(符号/代数;没有隐藏的数值微分,就像在基于 SLSQP 的代码中所做的那样)。
提示: 比较 SLSQP 的 nfev
和 nit
以查看数值微分的开销!
方法
- 您只对优化两个变量的凸组合感兴趣
- 这很容易映射到标量优化
- 优化
W1
(有界)并且隐含W2
!
代码(未优化的 hack)
原装配件
from scipy.optimize import minimize
import numpy
numpy.random.seed(0)
from time import perf_counter as pc
"""
ORIGNAL CODE
"""
def objective(W,X1,X2,A):
W1=W[0]
W2=W[1]
X=numpy.hstack((W1*X1,W2*X2))
return numpy.sum((X-A)**2)
def constraint1(W):
W1=W[0]
W2=W[1]
return W1+W2-1
x0=[[1,0]]
cons = {'type': 'eq', 'fun':constraint1}
#Random data only used for purposes of example
segment_1 = numpy.random.rand(1000, 20000)
segment_2 = numpy.random.rand(1000, 20000)
A = numpy.random.rand(1000, 40000)
# MODIFICATION: make instance more interesting! != ~0.5/0.5 solution
segment_1 *= 1.3
start_time = pc()
sol = minimize(objective,x0[0],args=(segment_1,segment_2,A),method='SLSQP',constraints=cons)
end_time = pc()
print(sol)
print('secs: {}'.format(end_time - start_time))
修改部分(同上source/file)
"""
ALTERNATIVE CODE hacked around ORIGINAL CODE
"""
from scipy.optimize import minimize_scalar
def solve_alternative(ojective_func, segment_1, segment_2, A):
objective_func_wrapper = lambda x: ojective_func(
(x, 1.0-x), segment_1, segment_2, A)
x = minimize_scalar(objective_func_wrapper, method='Bounded', bounds=(0, 1))
return x
start_time = pc()
sol = solve_alternative(objective, segment_1, segment_2, A)
end_time = pc()
print(sol)
print('secs: {}'.format(end_time - start_time))
示例输出(在类似平板电脑的设备上尺寸缩小到 运行)
fun: 6280656.612055224
jac: array([-2736633.5 , -2736633.375])
message: 'Optimization terminated successfully.'
nfev: 19
nit: 3
njev: 3
status: 0
success: True
x: array([0.45541614, 0.54458386])
secs: 32.6327816
fun: 6280656.612055217
message: 'Solution found.'
nfev: 6
status: 0
success: True
x: 0.45541616584551015
secs: 9.930487200000002
虽然我利用了另一个答案中只有两个变量,但这里我们专注于进行有效的函数评估!
利用 objective 固有的简单性允许重用原始的基于 SLSQP 的优化,同时准备好额外的 段/变量 将来(如评论中所示)只要结构保持不变。
优化成本应该大约等于单个函数评估的成本!
想法
- 由于潜在的内存分配和数组副本,原始函数尽可能低效(例如
np.stack()
)- 可以通过没有内存操作的存储顺序并行评估来改进(参见代码)
- 由于涉及大量数据,通常评估成本很高(在某些优化器的每次迭代中)
- 可以重新制定!
- 尽可能多地进行先验计算,以加快计算速度,具体取决于 仅限优化变量!
- 可以重新制定!
改写基本上是遵循 WolframAlpha:
备注:
- 任务以矩阵形式呈现,但它实际上只是一个扁平化/逐元素计算
- 这在向 WolframAlpha 询问 ideas 之前被直接利用:见输入!
- 比较:
W1 = x
W2 = y
X1 = v_i
(但 1d)X2 = w_i
(但 1d)A = a_i
和b_i
(分解+1d)
代码
import numpy as np
from scipy.optimize import minimize
from time import perf_counter as pc
np.random.seed(0)
# random fake-data
# ################
segment_1 = np.random.rand(5000, 10000) * 7.13
segment_2 = np.random.rand(5000, 10000) * np.random.normal(size=(5000, 10000))
A = np.random.rand(5000, 20000)
# constraint: probability-simplex
# ###############################
def constraint(x):
return np.sum(x) - 1.
# objective
# #########
# original -> very inefficient due to lots of potential memcopy
# -------------------------------------------------------------
def objective(x):
W1=x[0]
W2=x[1]
X=np.hstack((W1*segment_1, W2*segment_2))
return np.sum((X-A)**2)
# modified -> (hopefully) no memory-allocation at all; (hopefully) storage-order parallel iteration
# -------------------------------------------------------------------------------------------------
def objective_perf(x):
return np.sum( ((x[0] * segment_1) - A[:, :segment_1.shape[1]])**2 ) \
+ np.sum( ((x[1] * segment_2) - A[:, segment_1.shape[1]:])**2 )
# heavily reformulated
# ####################
start_time = pc()
# pre-calc: flatten out matrices as we are doing element-wise reasoning anyway
flat_A_segment_A = A[:, :segment_1.shape[1]].ravel()
flat_A_segment_B = A[:, segment_1.shape[1]:].ravel()
flat_segment_A = segment_1.ravel()
flat_segment_B = segment_2.ravel()
# pre-calc: sub-expressions (see WolframAlpha image!) / sum_squares(vec) = np.dot(vec, vec)
comp_0 = np.dot(flat_A_segment_A, flat_A_segment_A) + np.dot(flat_A_segment_B, flat_A_segment_B)
comp_1 = -2 * np.dot(flat_A_segment_A, flat_segment_A)
comp_2 = -2 * np.dot(flat_A_segment_B, flat_segment_B)
comp_3 = np.dot(flat_segment_A, flat_segment_A)
comp_4 = np.dot(flat_segment_B, flat_segment_B)
end_time = pc()
print('pre-calc secs: {}\n'.format(end_time - start_time))
# pre-calc based function-eval / basically *for free*
def objective_high_perf(x):
return comp_0 + x[0] * comp_1 + x[1] * comp_2 + x[0]**2 * comp_3 + x[1]**2 * comp_4
# SLSQP-based solving
# -------------------
cons = {'type': 'eq', 'fun': constraint}
x0 = [1.0, 0.0]
print('-----\nVariant: "objective"\n-----')
start_time = pc()
sol = minimize(objective_perf, x0, method='SLSQP', constraints=cons)
end_time = pc()
print(sol)
print('secs: {}\n'.format(end_time - start_time))
print('-----\nVariant: "objective_perf"\n-----')
start_time = pc()
sol = minimize(objective_perf, x0, method='SLSQP', constraints=cons)
end_time = pc()
print(sol)
print('secs: {}\n'.format(end_time - start_time))
print('-----\nVariant: "objective_high_perf"\n-----')
start_time = pc()
sol = minimize(objective_high_perf, x0, method='SLSQP', constraints=cons)
end_time = pc()
print(sol)
print('secs: {}\n'.format(end_time - start_time))
输出(由于类似平板设备的内存,不是完整数据)
pre-calc secs: 1.1280025999999999
-----
Variant: "objective"
-----
fun: 37044840.62293503
jac: array([29253964., 29253786.])
message: 'Optimization terminated successfully'
nfev: 16
nit: 2
njev: 2
status: 0
success: True
x: array([0.12245548, 0.87754452])
secs: 49.2468216
-----
Variant: "objective_perf"
-----
fun: 37044840.62293503
jac: array([29253964., 29253786.])
message: 'Optimization terminated successfully'
nfev: 16
nit: 2
njev: 2
status: 0
success: True
x: array([0.12245548, 0.87754452])
secs: 49.036501799999996
-----
Variant: "objective_high_perf"
-----
fun: 37044840.622934975
jac: array([29253784. , 29253777.5])
message: 'Optimization terminated successfully'
nfev: 15
nit: 2
njev: 2
status: 0
success: True
x: array([0.12245547, 0.87754453])
secs: 0.010043600000003039
预期
我猜你的 10 分钟 运行 现在应该小于 10 秒。
在我的示例中,~50 秒 已减少到~1.13 + ~0.01 = ~1.14 秒