Python HDF5 稀疏核心数据集

Question

如何在 Python 中将稀疏 NDArray 存储在磁盘上？

我正在回答我自己的问题，因为我浪费了将近一周的时间试图从核心矩阵中稀疏化。也许这对某些人来说是显而易见的，但对我和另一个可怜的人来说却不是！

Answer 1

根据已接受的答案 here and then tested with datasets made by h5py 的提示，以下时间序列测试有效。

>>> f = h5py.File('./test.h5')
>>> d = f.create_dataset('test', (10000, 10000), chunks=(100, 100))
>>> f.flush()
>>> d[1,1] = 1.0
>>> f.flush()
>>> d[2,1] = 1.0
>>> f.flush()
>>> d[2,100] = 1.0
>>> f.flush()
>>> d[2000,100] = 1.0
>>> f.flush()
>>> d[2000,1000] = 1.0
>>> f.flush()
>>>

以下是 bash 每次刷新后报告的文件大小

$ ls -lth test.h5
-rw-rw-r-- 1 aidan aidan 1.4K Jul 28 18:51 test.h5
$ ls -lth test.h5
-rw-rw-r-- 1 aidan aidan 43K Jul 28 18:51 test.h5
$ ls -lth test.h5
-rw-rw-r-- 1 aidan aidan 43K Jul 28 18:52 test.h5
$ ls -lth test.h5
-rw-rw-r-- 1 aidan aidan 83K Jul 28 18:52 test.h5
$ ls -lth test.h5
-rw-rw-r-- 1 aidan aidan 122K Jul 28 18:52 test.h5
$ ls -lth test.h5
-rw-rw-r-- 1 aidan aidan 161K Jul 28 18:53 test.h5
$

可以看出，文件的大小仅以 40Kb（100x100 浮点数）为增量增加，并且仅当生成的元素超出现有块的大小时才增加。我们也可以跳过，只制作需要的块（即不是中间块）！

魔法！

Python HDF5 稀疏核心数据集

Python HDF5 Sparse Out of Core Datasets

python

sparse-matrix

multidimensional-array