scipy python 中可能存在稀疏数组内存泄漏
possible scipy Sparse array memory leak in python
编辑 3:TL;DR 我的问题是由于我的矩阵不够稀疏,并且还错误地计算了稀疏数组的大小。
希望有人能向我解释为什么会这样。我正在使用具有 51 GB 内存的 colab,我需要从 H5 文件 float32 加载数据。我能够将测试 H5 文件加载为 numpy 数组和 RAM ~ 45 GB。我分批加载(总共 21 个)并将其堆叠。然后我尝试将数据加载到 numpy 中,转换成稀疏数据并 hstack 数据,内存爆炸,我在第 12 批次左右后得到一个 OOM。
此代码对其进行了模拟,您可以更改数据大小以在您的计算机上对其进行测试。我得到了完全无法解释的内存增加,即使当我查看内存中变量的大小时,它们看起来很小。怎么了?我做错了什么?
import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')
for k in range(8):
if all_x is None:
all_x = x2
else:
all_x = sparse.hstack([all_x, x2])
print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
gc.collect()
print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
print('_____________________')
GB on Memory SPARSE 0.481035332
GB on Memory NUMPY 0.797949952
sparse to dense mat ratio 0.6028389760464576
_____________________
GB on Memory ALL SPARSE 0.481035332
GB USED BEFORE GC 4.62065664
GB USED AFTER GC 4.6206976
_____________________
GB on Memory ALL SPARSE 0.962070664
GB USED BEFORE GC 8.473133056
GB USED AFTER GC 8.473133056
_____________________
GB on Memory ALL SPARSE 1.443105996
GB USED BEFORE GC 12.325183488
GB USED AFTER GC 12.325183488
_____________________
GB on Memory ALL SPARSE 1.924141328
GB USED BEFORE GC 17.140740096
GB USED AFTER GC 17.140740096
_____________________
GB on Memory ALL SPARSE 2.40517666
GB USED BEFORE GC 20.512710656
GB USED AFTER GC 20.512710656
_____________________
GB on Memory ALL SPARSE 2.886211992
GB USED BEFORE GC 22.920142848
GB USED AFTER GC 22.920142848
_____________________
GB on Memory ALL SPARSE 3.367247324
GB USED BEFORE GC 29.660889088
GB USED AFTER GC 29.660889088
_____________________
GB on Memory ALL SPARSE 3.848282656
GB USED BEFORE GC 33.99727104
GB USED AFTER GC 33.99727104
_____________________
编辑:我在 numpy hstack 中堆叠了一个列表,它工作正常
import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')
all_x = np.hstack([x]*21)
print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
gc.collect()
print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
print('_____________________')
输出
GB on Memory SPARSE 0.480956104
GB on Memory NUMPY 0.797949952
sparse to dense mat ratio 0.6027396866113227
_____________________
GB on Memory ALL SPARSE 16.756948992
GB USED BEFORE GC 38.169387008
GB USED AFTER GC 38.169411584
_____________________
但是当我对稀疏矩阵执行相同操作时,我得到了 OOM。根据字节,稀疏矩阵应该更小。
import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')
all_x = sparse.hstack([x2]*21)
print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
gc.collect()
print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
print('_____________________')
但是当我这样做时 returns OOM 错误
编辑 2
看来我错误地计算了稀疏矩阵的真实大小。可以使用
计算
def bytes_in_sparse(a):
return a.data.nbytes + a.indptr.nbytes + a.indices.nbytes
密集数组和稀疏数组之间的真实比较是
GB on Memory SPARSE 0.962395268
GB on Memory NUMPY 0.797949952
sparse to dense mat ratio 1.2060847495357703
一旦我使用sparse.hstack
,这两个变量就变成了不同类型的稀疏矩阵。
all_x, x2
产出
(<97406x4096 sparse matrix of type '<class 'numpy.float32'>'
with 240476696 stored elements in COOrdinate format>,
<97406x2048 sparse matrix of type '<class 'numpy.float32'>'
with 120238348 stored elements in Compressed Sparse Row format>)
尺寸更小,所以我不会挂起我的电脑
In [50]: x = (1 * (np.random.rand(974, 204) > 0.39721115241072164)).astype("float32")
In [51]: x.nbytes
Out[51]: 794784
CSR 和大概的内存使用:
In [52]: M = sparse.csr_matrix(x)
In [53]: M.data.nbytes + M.indices.nbytes + M.indptr.nbytes
Out[53]: 960308
hstack
实际使用coo
格式:
In [54]: Mo = M.tocoo()
In [55]: Mo.data.nbytes + Mo.row.nbytes + Mo.col.nbytes
Out[55]: 1434612
合并 10 个副本 - nbytes 增加 10:
In [56]: xx = np.hstack([x]*10)
In [57]: xx.shape
Out[57]: (974, 2040)
与稀疏相同:
In [58]: MM = sparse.hstack([M] * 10)
In [59]: MM.shape
Out[59]: (974, 2040)
In [60]: xx.nbytes
Out[60]: 7947840
In [61]: MM
Out[61]:
<974x2040 sparse matrix of type '<class 'numpy.float32'>'
with 1195510 stored elements in Compressed Sparse Row format>
In [62]: M
Out[62]:
<974x204 sparse matrix of type '<class 'numpy.float32'>'
with 119551 stored elements in Compressed Sparse Row format>
In [63]: MM.data.nbytes + MM.indices.nbytes + MM.indptr.nbytes
Out[63]: 9567980
密度稀疏
In [65]: M.nnz / np.prod(M.shape)
Out[65]: 0.6016779401699078
不节省内存。如果你想节省内存和计算时间(尤其是矩阵乘法),0.1 或更小是一个很好的工作密度。
In [66]: (x@x.T).shape
Out[66]: (974, 974)
In [67]: timeit(x@x.T).shape
10.1 ms ± 31.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [68]: (M@M.T).shape
Out[68]: (974, 974)
In [69]: timeit(M@M.T).shape
220 ms ± 91.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
编辑 3:TL;DR 我的问题是由于我的矩阵不够稀疏,并且还错误地计算了稀疏数组的大小。
希望有人能向我解释为什么会这样。我正在使用具有 51 GB 内存的 colab,我需要从 H5 文件 float32 加载数据。我能够将测试 H5 文件加载为 numpy 数组和 RAM ~ 45 GB。我分批加载(总共 21 个)并将其堆叠。然后我尝试将数据加载到 numpy 中,转换成稀疏数据并 hstack 数据,内存爆炸,我在第 12 批次左右后得到一个 OOM。
此代码对其进行了模拟,您可以更改数据大小以在您的计算机上对其进行测试。我得到了完全无法解释的内存增加,即使当我查看内存中变量的大小时,它们看起来很小。怎么了?我做错了什么?
import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')
for k in range(8):
if all_x is None:
all_x = x2
else:
all_x = sparse.hstack([all_x, x2])
print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
gc.collect()
print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
print('_____________________')
GB on Memory SPARSE 0.481035332
GB on Memory NUMPY 0.797949952
sparse to dense mat ratio 0.6028389760464576
_____________________
GB on Memory ALL SPARSE 0.481035332
GB USED BEFORE GC 4.62065664
GB USED AFTER GC 4.6206976
_____________________
GB on Memory ALL SPARSE 0.962070664
GB USED BEFORE GC 8.473133056
GB USED AFTER GC 8.473133056
_____________________
GB on Memory ALL SPARSE 1.443105996
GB USED BEFORE GC 12.325183488
GB USED AFTER GC 12.325183488
_____________________
GB on Memory ALL SPARSE 1.924141328
GB USED BEFORE GC 17.140740096
GB USED AFTER GC 17.140740096
_____________________
GB on Memory ALL SPARSE 2.40517666
GB USED BEFORE GC 20.512710656
GB USED AFTER GC 20.512710656
_____________________
GB on Memory ALL SPARSE 2.886211992
GB USED BEFORE GC 22.920142848
GB USED AFTER GC 22.920142848
_____________________
GB on Memory ALL SPARSE 3.367247324
GB USED BEFORE GC 29.660889088
GB USED AFTER GC 29.660889088
_____________________
GB on Memory ALL SPARSE 3.848282656
GB USED BEFORE GC 33.99727104
GB USED AFTER GC 33.99727104
_____________________
编辑:我在 numpy hstack 中堆叠了一个列表,它工作正常
import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')
all_x = np.hstack([x]*21)
print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
gc.collect()
print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
print('_____________________')
输出
GB on Memory SPARSE 0.480956104
GB on Memory NUMPY 0.797949952
sparse to dense mat ratio 0.6027396866113227
_____________________
GB on Memory ALL SPARSE 16.756948992
GB USED BEFORE GC 38.169387008
GB USED AFTER GC 38.169411584
_____________________
但是当我对稀疏矩阵执行相同操作时,我得到了 OOM。根据字节,稀疏矩阵应该更小。
import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')
all_x = sparse.hstack([x2]*21)
print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
gc.collect()
print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9)
print('_____________________')
但是当我这样做时 returns OOM 错误
编辑 2 看来我错误地计算了稀疏矩阵的真实大小。可以使用
计算def bytes_in_sparse(a):
return a.data.nbytes + a.indptr.nbytes + a.indices.nbytes
密集数组和稀疏数组之间的真实比较是
GB on Memory SPARSE 0.962395268
GB on Memory NUMPY 0.797949952
sparse to dense mat ratio 1.2060847495357703
一旦我使用sparse.hstack
,这两个变量就变成了不同类型的稀疏矩阵。
all_x, x2
产出
(<97406x4096 sparse matrix of type '<class 'numpy.float32'>'
with 240476696 stored elements in COOrdinate format>,
<97406x2048 sparse matrix of type '<class 'numpy.float32'>'
with 120238348 stored elements in Compressed Sparse Row format>)
尺寸更小,所以我不会挂起我的电脑
In [50]: x = (1 * (np.random.rand(974, 204) > 0.39721115241072164)).astype("float32")
In [51]: x.nbytes
Out[51]: 794784
CSR 和大概的内存使用:
In [52]: M = sparse.csr_matrix(x)
In [53]: M.data.nbytes + M.indices.nbytes + M.indptr.nbytes
Out[53]: 960308
hstack
实际使用coo
格式:
In [54]: Mo = M.tocoo()
In [55]: Mo.data.nbytes + Mo.row.nbytes + Mo.col.nbytes
Out[55]: 1434612
合并 10 个副本 - nbytes 增加 10:
In [56]: xx = np.hstack([x]*10)
In [57]: xx.shape
Out[57]: (974, 2040)
与稀疏相同:
In [58]: MM = sparse.hstack([M] * 10)
In [59]: MM.shape
Out[59]: (974, 2040)
In [60]: xx.nbytes
Out[60]: 7947840
In [61]: MM
Out[61]:
<974x2040 sparse matrix of type '<class 'numpy.float32'>'
with 1195510 stored elements in Compressed Sparse Row format>
In [62]: M
Out[62]:
<974x204 sparse matrix of type '<class 'numpy.float32'>'
with 119551 stored elements in Compressed Sparse Row format>
In [63]: MM.data.nbytes + MM.indices.nbytes + MM.indptr.nbytes
Out[63]: 9567980
密度稀疏
In [65]: M.nnz / np.prod(M.shape)
Out[65]: 0.6016779401699078
不节省内存。如果你想节省内存和计算时间(尤其是矩阵乘法),0.1 或更小是一个很好的工作密度。
In [66]: (x@x.T).shape
Out[66]: (974, 974)
In [67]: timeit(x@x.T).shape
10.1 ms ± 31.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [68]: (M@M.T).shape
Out[68]: (974, 974)
In [69]: timeit(M@M.T).shape
220 ms ± 91.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)