使用并行化和 numpy 的多维矩阵的时间和内存复杂性管理
Time & memory complexity management with multi-dimensional matrices using parallelisation and numpy
我有一个非常大的矩阵时间序列。我正在努力加快这个过程,并且想知道执行此操作的最佳方法。想到的两件事是使用 numba 并行化过程或将函数应用于矩阵,例如 np.apply_along_axis.
速度和内存复杂度非常重要。我附上了一些示例代码来生成这些矩阵。真实的要大得多(形状大于 (400, 400, 400, 400))。考虑到嵌套循环,我假设“determineShiftsJmax”和“addPadding”这两个函数的时间最复杂。
import numpy as np
def determineShiftsJmax(m, jmaxR, jmaxC):
layerR = min(m, jmaxR - 1)
layerC = min(m, jmaxC - 1)
nodesR = 2 * layerR + 1
nodesC = 2 * layerC + 1
u = range(0, nodesR)
b = range(0, nodesC)
mat = np.zeros((nodesR, nodesC), dtype=object)
for x, i in enumerate(u):
for y, j in enumerate(b):
up = (j, 2 * layerC - j)
left = (i, 2 * layerR - i)
mat[x, y] = (left, up)
if (jmaxC <= jmaxR) and (m >= jmaxC):
res = np.pad(mat, 1, mode="edge")
elif (jmaxR <= jmaxC) and (m >= jmaxR):
res = np.pad(mat, 1, mode="edge")
else:
res = mat
return res
def addPadding(array, shift, shape):
paddedMatrix = []
for i in range(shape[0]):
for j in range(shape[1]):
padding = np.pad(array[i, j], shift[i, j])
paddedMatrix.append(padding)
paddedMatrix = np.array(paddedMatrix)
return paddedMatrix
shapeE = [(1, 1, 3, 3),
(3, 3, 5, 5),
(5, 5, 7, 7),
(7, 7, 9, 7),
(9, 7, 11, 7),
(11, 7, 11, 7),
(11, 7, 11, 7),
(11, 7, 11, 7),
(11, 7, 11, 7),
(11, 7, 11, 7)]
shapeI = [(1, 1, 3, 3),
(3, 3, 3, 3),
(5, 5, 3, 3),
(7, 7, 3, 3),
(9, 7, 3, 3),
(11, 7, 3, 3),
(11, 7, 3, 3),
(11, 7, 3, 3),
(11, 7, 3, 3),
(11, 7, 3, 3)]
qs = [np.ones(x) for x in shapeI]
jmaxR = 3
jmaxC = 5
# Time step 0
m = 0
shift = determineShiftsJmax(m, jmaxC, jmaxR)
newMatrix = addPadding(qs[m], shift, shapeI[m])
# all time Steps
newMatrices = []
for m, shape in enumerate(shapeI):
shift = determineShiftsJmax(m, jmaxC, jmaxR)
newMatrix = addPadding(qs[m], shift, shape)
newMatrix = newMatrix.reshape(shapeE[m])
newMatrices.append(newMatrix)
不仅仅是 python 循环会降低您的代码速度。所以在进入并行化之前,你应该尝试改进内存布局。
numpy 数组是内存中连续的字节序列。因此,无论何时您预先或附加某些内容(例如填充),您都需要创建一个更大的新内容,设置填充值并复制剩余部分。这使您的代码非常低效。
关键是只分配一次输出数组,然后将传入数组的部分复制到它们的目的地。请参阅以下示例:
def crate_new_matirx(old_matrix, new_shape, m, jmaxR, jmaxC):
old_shape = old_matrix.shape
# allocate the memory
new_matrix = np.zeros_like(old_matrix, shape=new_shape)
# calculate the shift magic
layerR = min(m, jmaxR - 1)
layerC = min(m, jmaxC - 1)
nodesR = 2 * layerR + 1
nodesC = 2 * layerC + 1
repeat = (
(jmaxC <= jmaxR) and (m >= jmaxC) or
(jmaxR <= jmaxC) and (m >= jmaxR)
)
# do the loop to copy the input at the right location
for l in range(new_shape[0]):
for k in range(new_shape[1]):
jlow = min(max(0, l - int(repeat)), new_shape[2] - old_shape[2])
jhigh = jlow + old_shape[2]
ilow = min(max(0, k - int(repeat)), new_shape[3] - old_shape[3])
ihigh = ilow + old_shape[3]
new_matrix[l, k, jlow:jhigh, ilow:ihigh] = old_matrix[l, k, :, :]
return new_matrix
但是,您提到,您期望形状为 (400, 400, 400, 400),因此您的输出数组可以轻松填充许多 GB 的内存,假设 float64,您将需要大约 200 GB。如果你真的需要在内存中创建填充矩阵,那么知道大多数值将是零可能会导致问题。我不知道你之后想用它们做什么,但如果你的算法将需要这些矩阵进行任何后续计算,你也可以尝试获取索引映射以确定你尝试访问的值是零还是实际数据价值。您还可以编写一个函数来获取给定外部两个索引 (l, k
) 的内部矩阵,并在需要时调用它。
我有一个非常大的矩阵时间序列。我正在努力加快这个过程,并且想知道执行此操作的最佳方法。想到的两件事是使用 numba 并行化过程或将函数应用于矩阵,例如 np.apply_along_axis.
速度和内存复杂度非常重要。我附上了一些示例代码来生成这些矩阵。真实的要大得多(形状大于 (400, 400, 400, 400))。考虑到嵌套循环,我假设“determineShiftsJmax”和“addPadding”这两个函数的时间最复杂。
import numpy as np
def determineShiftsJmax(m, jmaxR, jmaxC):
layerR = min(m, jmaxR - 1)
layerC = min(m, jmaxC - 1)
nodesR = 2 * layerR + 1
nodesC = 2 * layerC + 1
u = range(0, nodesR)
b = range(0, nodesC)
mat = np.zeros((nodesR, nodesC), dtype=object)
for x, i in enumerate(u):
for y, j in enumerate(b):
up = (j, 2 * layerC - j)
left = (i, 2 * layerR - i)
mat[x, y] = (left, up)
if (jmaxC <= jmaxR) and (m >= jmaxC):
res = np.pad(mat, 1, mode="edge")
elif (jmaxR <= jmaxC) and (m >= jmaxR):
res = np.pad(mat, 1, mode="edge")
else:
res = mat
return res
def addPadding(array, shift, shape):
paddedMatrix = []
for i in range(shape[0]):
for j in range(shape[1]):
padding = np.pad(array[i, j], shift[i, j])
paddedMatrix.append(padding)
paddedMatrix = np.array(paddedMatrix)
return paddedMatrix
shapeE = [(1, 1, 3, 3),
(3, 3, 5, 5),
(5, 5, 7, 7),
(7, 7, 9, 7),
(9, 7, 11, 7),
(11, 7, 11, 7),
(11, 7, 11, 7),
(11, 7, 11, 7),
(11, 7, 11, 7),
(11, 7, 11, 7)]
shapeI = [(1, 1, 3, 3),
(3, 3, 3, 3),
(5, 5, 3, 3),
(7, 7, 3, 3),
(9, 7, 3, 3),
(11, 7, 3, 3),
(11, 7, 3, 3),
(11, 7, 3, 3),
(11, 7, 3, 3),
(11, 7, 3, 3)]
qs = [np.ones(x) for x in shapeI]
jmaxR = 3
jmaxC = 5
# Time step 0
m = 0
shift = determineShiftsJmax(m, jmaxC, jmaxR)
newMatrix = addPadding(qs[m], shift, shapeI[m])
# all time Steps
newMatrices = []
for m, shape in enumerate(shapeI):
shift = determineShiftsJmax(m, jmaxC, jmaxR)
newMatrix = addPadding(qs[m], shift, shape)
newMatrix = newMatrix.reshape(shapeE[m])
newMatrices.append(newMatrix)
不仅仅是 python 循环会降低您的代码速度。所以在进入并行化之前,你应该尝试改进内存布局。
numpy 数组是内存中连续的字节序列。因此,无论何时您预先或附加某些内容(例如填充),您都需要创建一个更大的新内容,设置填充值并复制剩余部分。这使您的代码非常低效。 关键是只分配一次输出数组,然后将传入数组的部分复制到它们的目的地。请参阅以下示例:
def crate_new_matirx(old_matrix, new_shape, m, jmaxR, jmaxC):
old_shape = old_matrix.shape
# allocate the memory
new_matrix = np.zeros_like(old_matrix, shape=new_shape)
# calculate the shift magic
layerR = min(m, jmaxR - 1)
layerC = min(m, jmaxC - 1)
nodesR = 2 * layerR + 1
nodesC = 2 * layerC + 1
repeat = (
(jmaxC <= jmaxR) and (m >= jmaxC) or
(jmaxR <= jmaxC) and (m >= jmaxR)
)
# do the loop to copy the input at the right location
for l in range(new_shape[0]):
for k in range(new_shape[1]):
jlow = min(max(0, l - int(repeat)), new_shape[2] - old_shape[2])
jhigh = jlow + old_shape[2]
ilow = min(max(0, k - int(repeat)), new_shape[3] - old_shape[3])
ihigh = ilow + old_shape[3]
new_matrix[l, k, jlow:jhigh, ilow:ihigh] = old_matrix[l, k, :, :]
return new_matrix
但是,您提到,您期望形状为 (400, 400, 400, 400),因此您的输出数组可以轻松填充许多 GB 的内存,假设 float64,您将需要大约 200 GB。如果你真的需要在内存中创建填充矩阵,那么知道大多数值将是零可能会导致问题。我不知道你之后想用它们做什么,但如果你的算法将需要这些矩阵进行任何后续计算,你也可以尝试获取索引映射以确定你尝试访问的值是零还是实际数据价值。您还可以编写一个函数来获取给定外部两个索引 (l, k
) 的内部矩阵,并在需要时调用它。