覆盖单元格数据时,MATLAB matfile 的大小会增加
MATLAB matfile increases in size when overwriting cell data
由于数据量大且自动保存频繁,我决定将保存方法从标准的 save() 函数更改为使用 matfile 对象进行部分保存:
https://www.mathworks.com/help/matlab/ref/matfile.html
我进行此更改是因为即使对结构进行了微小更改,使用 save() 也会覆盖所有内容,从而大大降低程序速度。但是我注意到每次调用 matfile 时保存的时间都线性增加,经过一些调试我注意到这是由于文件大小每次都增加,即使数据被相同的数据覆盖也是如此。这是一个例子:
% Save MAT file with string variable and cell variable
stringvar = 'hello'
cellvar = {'world'}
save('test.mat', 'stringvar', 'cellvar', '-v7.3')
m = matfile('test.mat', 'Writable', true);
% Get number of bytes of MAT file
f = dir('test.mat'); f.bytes
% Output: 3928 - inital size
% Overwrite stringvar with same data.
m.stringvar = 'hello';
f = dir('test.mat'); f.bytes
% Output: 3928 - same as before
% Overwrite cellvar with same data.
m.cellvar = {'world'};
f = dir('test.mat'); f.bytes
% Output: 4544 - size increased
我不明白为什么在数据相同的情况下字节数会增加。它增加了一个非常明显的时间延迟,每次保存都会增加,因此它破坏了部分保存的目的。知道这里发生了什么吗?对此提供帮助将不胜感激!
这是由于元胞数组和更复杂的数据类型在 7.3 (HDF5) mat 文件中的存储(和更新)方式所致。由于元胞数组包含混合数据类型,MATLAB 将元胞数组变量存储在根 (/
) HDF5 group as a series of references which point to the /#refs#
group which contains datasets 中,每个根包含一个元胞的数据。
每当您尝试覆盖元胞数组值时,/#refs#
中的 /#refs#
HDF5 group gets appended to with new datasets which represent the cell array element data and the refrences in the /
group are updated to point to this new data. The old (and now un-used) datasets 都不会被删除。这是 HDF5 文件的设计行为,因为从文件中删除数据需要将删除区域之后的所有文件内容移动到 "close the gap",这会导致(可能是巨大的)性能损失**。
我们可以使用 h5disp
查看 MATLAB 正在创建的文件的内容来说明这一点。下面我将使用 h5disp
的缩写输出,这样它更清晰:
stringvar = 'hello';
cellvar = {'world'};
save('test.mat', 'stringvar', 'cellvar', '-v7.3')
h5disp('test.mat')
% HDF5 test.mat
% Group '/'
% Dataset 'cellvar' <--- YOUR CELL ARRAY
% Size: 1x1 <--- HERE IS ITS SIZE
% Datatype: H5T_REFERENCE <--- THE ACTUAL DATA LIVES IN /#REFS#
% Attributes:
% 'MATLAB_class': 'cell'
% Dataset 'stringvar' <--- YOUR STRING
% Size: 1x5 <--- HAS 5 CHARACTERS
% Datatype: H5T_STD_U16LE (uint16)
% Attributes:
% 'MATLAB_class': 'char'
% 'MATLAB_int_decode': 2
% Group '/#refs#' <--- WHERE THE DATA FOR THE CELL ARRAY LIVES
% Attributes:
% 'H5PATH': '/#refs#'
% Dataset 'a'
% Size: 2
% Datatype: H5T_STD_U64LE (uint64)
% Attributes:
% 'MATLAB_empty': 1
% 'MATLAB_class': 'canonical empty'
% Dataset 'b' <--- THE CELL ARRAY DATA
% Size: 1x5 <--- CONTAINS A 5-CHAR STRING
% Datatype: H5T_STD_U16LE (uint16)
% Attributes:
% 'MATLAB_class': 'char'
% 'MATLAB_int_decode': 2
% 'H5PATH': '/#refs#/b'
%% Now we want to replace the string with a 6-character string
m.stringvar = 'hellos';
h5disp('test.mat')
% HDF5 test.mat
% Group '/'
% Dataset 'cellvar' <--- THIS REMAINS UNCHANGED
% Size: 1x1
% Datatype: H5T_REFERENCE
% Attributes:
% 'MATLAB_class': 'cell'
% Dataset 'stringvar'
% Size: 1x6 <--- JUST INCREASED THE LENGTH OF THIS TO 6
% Datatype: H5T_STD_U16LE (uint16)
% Attributes:
% 'MATLAB_class': 'char'
% 'MATLAB_int_decode': 2
% Group '/#refs#'
% Attributes:
% 'H5PATH': '/#refs#'
% Dataset 'a' <--- NONE OF THIS HAS CHANGED
% Size: 2
% Datatype: H5T_STD_U64LE (uint64)
% Attributes:
% 'MATLAB_empty': 1
% 'MATLAB_class': 'canonical empty'
% Dataset 'b'
% Size: 1x5
% Datatype: H5T_STD_U16LE (uint16)
% Attributes:
% 'MATLAB_class': 'char'
% 'MATLAB_int_decode': 2
% 'H5PATH': '/#refs#/b'
%% Now change the cell (and replace with a 6-character string)
m.cellvar = {'worlds'};
% HDF5 test.mat
% Group '/'
% Dataset 'cellvar' <--- HERE IS YOUR CELL ARRAY AGAIN
% Size: 1x1
% Datatype: H5T_REFERENCE <--- STILL A REFERENCE
% Attributes:
% 'MATLAB_class': 'cell'
% Dataset 'stringvar' <--- STRING VARIABLE UNCHANGED
% Size: 1x6
% Datatype: H5T_STD_U16LE (uint16)
% Attributes:
% 'MATLAB_class': 'char'
% 'MATLAB_int_decode': 2
% Group '/#refs#'
% Attributes:
% 'H5PATH': '/#refs#'
% Dataset 'a' <--- THE OLD DATA IS STILL HERE
% Size: 2
% Datatype: H5T_STD_U64LE (uint64)
% Attributes:
% 'MATLAB_empty': 1
% 'MATLAB_class': 'canonical empty'
% Dataset 'b' <--- THE OLD DATA IS STILL HERE
% Size: 1x5
% Datatype: H5T_STD_U16LE (uint16)
% Attributes:
% 'MATLAB_class': 'char'
% 'MATLAB_int_decode': 2
% 'H5PATH': '/#refs#/b'
% Dataset 'c' <--- THE NEW DATA IS ALSO HERE
% Size: 2
% Datatype: H5T_STD_U64LE (uint64)
% Attributes:
% 'MATLAB_empty': 1
% 'MATLAB_class': 'canonical empty'
% Dataset 'd' <--- THE NEW DATA IS ALSO HERE
% Size: 1x6 <--- NOW WITH 6 CHARACTERS
% Datatype: H5T_STD_U16LE (uint16)
% Attributes:
% 'MATLAB_class': 'char'
% 'MATLAB_int_decode': 2
% 'H5PATH': '/#refs#/d'
正是 #refs#
组的这种不断增加的大小导致您的文件大小增加。由于 #refs#
包含实际数据,因此每次保存文件时,将复制您要替换的元胞数组元素中的 所有 数据。
至于为什么 尽管这个看似很大的限制,Mathworks 选择将 HDF5 用于 7.3 mat 文件,似乎 7.3 文件的动机是帮助 访问 文件中的数据,而不是为了优化文件大小。
一种可能的解决方法是使用非 HDF5 格式的 7.0 格式,并且在修改元胞数组变量时文件大小不会增加。 7.0 与 7.3 的唯一真正缺点是 can't modify just part of a variable in the 7.0 files. An added benefit is that for complex data, the 7.0 .mat files are typically faster to read and write 与 7.3 HDF5 文件相比。
% Helper function to tell us the size
printsize = @(filename)disp(getfield(dir(filename), 'bytes'));
stringvar = 'hello'
cellvar = {'world'}
% Save as 7.0 version
save('test.mat', 'stringvar', 'cellvar', '-v7')
printsize('test.mat')
% 256
m = matfile('test.mat', 'Writable', true);
m.stringvar = 'hello';
printsize('test.mat')
% 256
m.cellvar = {'world'};
printsize('test.mat')
% 256
如果您仍想使用 7.3 文件,可能值得将元胞数组保存到一个临时变量,在您的函数中对其进行修改,并且很少将其写回文件以防止不必要的写入。
tmp = m.cellvar;
% Make many modifications
tmp{1} = 'hello';
tmp{2} = 'world';
tmp{1} = 'Just kidding!';
% Write once after all changes have been made
m.cellvar = tmp;
** Normally you could use h5repack
to reclaim the unused space in the file; however, MATLAB doesn't actually delete the data within /#refs#
so h5repack
has no effect. From what I gather, you'd have to delete the data yourself and then use h5repack
to free up the unused space.
fid = H5F.open('test2.mat', 'H5F_ACC_RDWR', 'H5P_DEFAULT');
% I've hard-coded these names just as an example
H5L.delete(fid, '/#refs#/a', 'H5P_DEFAULT')
H5L.delete(fid, '/#refs#/b', 'H5P_DEFAULT')
H5F.close(fid);
system('h5repack test.mat test.repacked.mat');
由于数据量大且自动保存频繁,我决定将保存方法从标准的 save() 函数更改为使用 matfile 对象进行部分保存:
https://www.mathworks.com/help/matlab/ref/matfile.html
我进行此更改是因为即使对结构进行了微小更改,使用 save() 也会覆盖所有内容,从而大大降低程序速度。但是我注意到每次调用 matfile 时保存的时间都线性增加,经过一些调试我注意到这是由于文件大小每次都增加,即使数据被相同的数据覆盖也是如此。这是一个例子:
% Save MAT file with string variable and cell variable
stringvar = 'hello'
cellvar = {'world'}
save('test.mat', 'stringvar', 'cellvar', '-v7.3')
m = matfile('test.mat', 'Writable', true);
% Get number of bytes of MAT file
f = dir('test.mat'); f.bytes
% Output: 3928 - inital size
% Overwrite stringvar with same data.
m.stringvar = 'hello';
f = dir('test.mat'); f.bytes
% Output: 3928 - same as before
% Overwrite cellvar with same data.
m.cellvar = {'world'};
f = dir('test.mat'); f.bytes
% Output: 4544 - size increased
我不明白为什么在数据相同的情况下字节数会增加。它增加了一个非常明显的时间延迟,每次保存都会增加,因此它破坏了部分保存的目的。知道这里发生了什么吗?对此提供帮助将不胜感激!
这是由于元胞数组和更复杂的数据类型在 7.3 (HDF5) mat 文件中的存储(和更新)方式所致。由于元胞数组包含混合数据类型,MATLAB 将元胞数组变量存储在根 (/
) HDF5 group as a series of references which point to the /#refs#
group which contains datasets 中,每个根包含一个元胞的数据。
每当您尝试覆盖元胞数组值时,/#refs#
中的 /#refs#
HDF5 group gets appended to with new datasets which represent the cell array element data and the refrences in the /
group are updated to point to this new data. The old (and now un-used) datasets 都不会被删除。这是 HDF5 文件的设计行为,因为从文件中删除数据需要将删除区域之后的所有文件内容移动到 "close the gap",这会导致(可能是巨大的)性能损失**。
我们可以使用 h5disp
查看 MATLAB 正在创建的文件的内容来说明这一点。下面我将使用 h5disp
的缩写输出,这样它更清晰:
stringvar = 'hello';
cellvar = {'world'};
save('test.mat', 'stringvar', 'cellvar', '-v7.3')
h5disp('test.mat')
% HDF5 test.mat
% Group '/'
% Dataset 'cellvar' <--- YOUR CELL ARRAY
% Size: 1x1 <--- HERE IS ITS SIZE
% Datatype: H5T_REFERENCE <--- THE ACTUAL DATA LIVES IN /#REFS#
% Attributes:
% 'MATLAB_class': 'cell'
% Dataset 'stringvar' <--- YOUR STRING
% Size: 1x5 <--- HAS 5 CHARACTERS
% Datatype: H5T_STD_U16LE (uint16)
% Attributes:
% 'MATLAB_class': 'char'
% 'MATLAB_int_decode': 2
% Group '/#refs#' <--- WHERE THE DATA FOR THE CELL ARRAY LIVES
% Attributes:
% 'H5PATH': '/#refs#'
% Dataset 'a'
% Size: 2
% Datatype: H5T_STD_U64LE (uint64)
% Attributes:
% 'MATLAB_empty': 1
% 'MATLAB_class': 'canonical empty'
% Dataset 'b' <--- THE CELL ARRAY DATA
% Size: 1x5 <--- CONTAINS A 5-CHAR STRING
% Datatype: H5T_STD_U16LE (uint16)
% Attributes:
% 'MATLAB_class': 'char'
% 'MATLAB_int_decode': 2
% 'H5PATH': '/#refs#/b'
%% Now we want to replace the string with a 6-character string
m.stringvar = 'hellos';
h5disp('test.mat')
% HDF5 test.mat
% Group '/'
% Dataset 'cellvar' <--- THIS REMAINS UNCHANGED
% Size: 1x1
% Datatype: H5T_REFERENCE
% Attributes:
% 'MATLAB_class': 'cell'
% Dataset 'stringvar'
% Size: 1x6 <--- JUST INCREASED THE LENGTH OF THIS TO 6
% Datatype: H5T_STD_U16LE (uint16)
% Attributes:
% 'MATLAB_class': 'char'
% 'MATLAB_int_decode': 2
% Group '/#refs#'
% Attributes:
% 'H5PATH': '/#refs#'
% Dataset 'a' <--- NONE OF THIS HAS CHANGED
% Size: 2
% Datatype: H5T_STD_U64LE (uint64)
% Attributes:
% 'MATLAB_empty': 1
% 'MATLAB_class': 'canonical empty'
% Dataset 'b'
% Size: 1x5
% Datatype: H5T_STD_U16LE (uint16)
% Attributes:
% 'MATLAB_class': 'char'
% 'MATLAB_int_decode': 2
% 'H5PATH': '/#refs#/b'
%% Now change the cell (and replace with a 6-character string)
m.cellvar = {'worlds'};
% HDF5 test.mat
% Group '/'
% Dataset 'cellvar' <--- HERE IS YOUR CELL ARRAY AGAIN
% Size: 1x1
% Datatype: H5T_REFERENCE <--- STILL A REFERENCE
% Attributes:
% 'MATLAB_class': 'cell'
% Dataset 'stringvar' <--- STRING VARIABLE UNCHANGED
% Size: 1x6
% Datatype: H5T_STD_U16LE (uint16)
% Attributes:
% 'MATLAB_class': 'char'
% 'MATLAB_int_decode': 2
% Group '/#refs#'
% Attributes:
% 'H5PATH': '/#refs#'
% Dataset 'a' <--- THE OLD DATA IS STILL HERE
% Size: 2
% Datatype: H5T_STD_U64LE (uint64)
% Attributes:
% 'MATLAB_empty': 1
% 'MATLAB_class': 'canonical empty'
% Dataset 'b' <--- THE OLD DATA IS STILL HERE
% Size: 1x5
% Datatype: H5T_STD_U16LE (uint16)
% Attributes:
% 'MATLAB_class': 'char'
% 'MATLAB_int_decode': 2
% 'H5PATH': '/#refs#/b'
% Dataset 'c' <--- THE NEW DATA IS ALSO HERE
% Size: 2
% Datatype: H5T_STD_U64LE (uint64)
% Attributes:
% 'MATLAB_empty': 1
% 'MATLAB_class': 'canonical empty'
% Dataset 'd' <--- THE NEW DATA IS ALSO HERE
% Size: 1x6 <--- NOW WITH 6 CHARACTERS
% Datatype: H5T_STD_U16LE (uint16)
% Attributes:
% 'MATLAB_class': 'char'
% 'MATLAB_int_decode': 2
% 'H5PATH': '/#refs#/d'
正是 #refs#
组的这种不断增加的大小导致您的文件大小增加。由于 #refs#
包含实际数据,因此每次保存文件时,将复制您要替换的元胞数组元素中的 所有 数据。
至于为什么 尽管这个看似很大的限制,Mathworks 选择将 HDF5 用于 7.3 mat 文件,似乎 7.3 文件的动机是帮助 访问 文件中的数据,而不是为了优化文件大小。
一种可能的解决方法是使用非 HDF5 格式的 7.0 格式,并且在修改元胞数组变量时文件大小不会增加。 7.0 与 7.3 的唯一真正缺点是 can't modify just part of a variable in the 7.0 files. An added benefit is that for complex data, the 7.0 .mat files are typically faster to read and write 与 7.3 HDF5 文件相比。
% Helper function to tell us the size
printsize = @(filename)disp(getfield(dir(filename), 'bytes'));
stringvar = 'hello'
cellvar = {'world'}
% Save as 7.0 version
save('test.mat', 'stringvar', 'cellvar', '-v7')
printsize('test.mat')
% 256
m = matfile('test.mat', 'Writable', true);
m.stringvar = 'hello';
printsize('test.mat')
% 256
m.cellvar = {'world'};
printsize('test.mat')
% 256
如果您仍想使用 7.3 文件,可能值得将元胞数组保存到一个临时变量,在您的函数中对其进行修改,并且很少将其写回文件以防止不必要的写入。
tmp = m.cellvar;
% Make many modifications
tmp{1} = 'hello';
tmp{2} = 'world';
tmp{1} = 'Just kidding!';
% Write once after all changes have been made
m.cellvar = tmp;
** Normally you could use
h5repack
to reclaim the unused space in the file; however, MATLAB doesn't actually delete the data within/#refs#
soh5repack
has no effect. From what I gather, you'd have to delete the data yourself and then useh5repack
to free up the unused space.fid = H5F.open('test2.mat', 'H5F_ACC_RDWR', 'H5P_DEFAULT'); % I've hard-coded these names just as an example H5L.delete(fid, '/#refs#/a', 'H5P_DEFAULT') H5L.delete(fid, '/#refs#/b', 'H5P_DEFAULT') H5F.close(fid); system('h5repack test.mat test.repacked.mat');