MATLAB fwrite with skip slow

Question

我正在使用 fwrite 命令在 MATLAB 中写入一些更大的 (~500MB - 3GB) 二进制数据。

我希望数据以表格格式写入，所以我使用了 skip 参数。例如。我有 2 个 uint8 值向量 a = [ 1 2 3 4]; b = [5 6 7 8]。我希望二进制文件看起来像这样 1 5 2 6 3 7 4 8

所以在我的代码中我做了类似的事情（我的数据更复杂）

fwrite(f,a,'1*uint8',1);
fseek(f,2)
fwrite(f,b,'1*uint8',1);

但是写入速度非常慢 (2MB/s)。

我运行以下代码块，当我将跳过计数设置为 1 时，写入速度大约慢 300 倍。

>> f = fopen('testfile.bin', 'w');
>> d = uint8(1:500e6);
>> tic; fwrite(f,d,'1*uint8',1); toc
Elapsed time is 58.759686 seconds.
>> tic; fwrite(f,d,'1*uint8',0); toc
Elapsed time is 0.200684 seconds.
>> 58.759686/0.200684

ans =

  292.7971

我可以理解 2 倍或 4 倍的减速，因为在将 skip 参数设置为 1 的情况下，您必须遍历两倍的字节数，但 300 倍让我觉得我做错了什么。

有人遇到过这个吗？有没有办法加快写入速度？

谢谢！

更新

我编写了以下函数来格式化任意数据集。大型数据集的写入速度大大提高 (~300MB/s)。

%
%  data: A cell array of matrices. Matrices can be composed of any
%        non-complex numeric data. Each entry in data is considered
%        to be an independent column in the data file. Rows are indexed
%        by the last column in the numeric matrix hence the count of elements
%        in the last dimension of the matrix must match. 
%
%   e.g. 
%   size(data{1}) == [1,5]
%   size(data{2}) == [4,5]
%   size(data{3}) == [3,2,5]
%
%   The data variable has 3 columns and 5 rows. Column 1 is made of scalar values
%   Column 2 is made of vectors of length 4. And column 3 is made of 3 x 2 
%   matrices
%
% 
%  returns buffer: a N x M matrix of bytes where N is the number of bytes
%  of each row of data, and M is the number of rows of data. 

function [buffer] = makeTabularDataBuffer(data)
    dataTypes = {};
    dataTypesLengthBytes = [];
    rowElementCounts = []; %the number of elements in each "row"

    rowCount = [];

    %figure out properties of tabular data
    for idx = 1:length(data)

        cDat = data{idx};
        dimSize = size(cDat);

        %ensure each column has the same number of rows.
        if isempty(rowCount)
            rowCount = dimSize(end);
        else
            if dimSize(end) ~= rowCount
                throw(MException('e:e', sprintf('data column %d does not have the required number of rows (%d)\n',idx,rowCount)));
            end
        end

        dataTypes{idx} = class(data{idx});
        dataTypesLengthBytes(idx) = length(typecast(eval([dataTypes{idx},'(1)']),'uint8'));
        rowElementCounts(idx) = prod(dimSize(1:end-1));

    end

    rowLengthBytes = sum(rowElementCounts .* dataTypesLengthBytes);
    buffer = zeros(rowLengthBytes, rowCount,'uint8'); %rows of the dataset map to column in the buffer matrix because fwrite writes columnwise

    bufferRowStartIdxs = cumsum([1 dataTypesLengthBytes .* rowElementCounts]);

    %load data 1 column at a time into the buffer
    for idx = 1:length(data)
        cDat = data{idx};
        columnWidthBytes = dataTypesLengthBytes(idx)*rowElementCounts(idx);

        cRowIdxs = bufferRowStartIdxs(idx):(bufferRowStartIdxs(idx+1)-1);

        buffer(cRowIdxs,:) = reshape(typecast(cDat(:),'uint8'),columnWidthBytes,[]); 
    end

end

我对该功能进行了一些非常有限的测试，但它似乎按预期工作。返回的然后可以在不使用 skip 参数的情况下将缓冲区矩阵传递给 fwrite，fwrite 将按列主要顺序写入缓冲区。

dat = {};
dat{1} = uint16([1 2 3 4]);
dat{2} = uint16([5 6 7 8]);
dat{3} = double([9 10 ; 11 12; 13 14; 15 16])';

buffer = makeTabularDataBuffer(dat)

buffer =

  20×4 uint8 matrix

    1    2    3    4
    0    0    0    0
    5    6    7    8
    0    0    0    0
    0    0    0    0
    0    0    0    0
    0    0    0    0
    0    0    0    0
    0    0    0    0
    0    0    0    0
   34   38   42   46
   64   64   64   64
    0    0    0    0
    0    0    0    0
    0    0    0    0
    0    0    0    0
    0    0    0    0
    0    0    0    0
   36   40   44   48
   64   64   64   64

Answer 1

为获得最佳 I/O 性能，请使用顺序写入并避免跳过。

在保存到文件之前重新排序 RAM 中的数据。
重新排序 RAM 中的数据比重新排序磁盘上的数据快 100 倍。

I/O 操作和存储设备针对大数据块的顺序写入进行了优化（在硬件和软件方面都进行了优化）。

在机械硬盘（HDD）中，跳过写入数据可能需要很长时间，因为驱动器的机械磁头必须移动（通常OS通过使用内存缓冲区对其进行优化，但原则上这需要很长时间）。

使用 SSD，没有机械寻道，但顺序写入仍然快得多。阅读以下 post Sequential vs Random I/O on SSDs? 以获得一些解释。

在 RAM 中重新排序数据的示例：

a = uint8([1 2 3 4]);
b = uint8([5 6 7 8]);

% Allocate memory space for reordered elements (use uint8 type to save RAM).
c = zeros(1, length(a) + length(b), 'uint8');

%Reorder a and b in the RAM.
c(1:2:end) = a;
c(2:2:end) = b;

% Write array c to file
fwrite(f, c, 'uint8');
fclose(f);

我机器中的时间测量值：

正在将文件写入 SSD：
Elapsed time is 56.363397 seconds.
Elapsed time is 0.280049 seconds.
正在将文件写入硬盘：
Elapsed time is 56.063186 seconds.
Elapsed time is 0.522933 seconds.
正在 RAM 中重新排序 d：
Elapsed time is 0.965358 seconds.

为什么慢 300 倍而不是 4 倍？
我猜想通过跳过写入数据的软件实现并未针对最佳性能进行优化。

根据以下post：

fseek() or fflush() require the library to commit buffered operations.

丹尼尔的猜测（在评论中）可能是正确的。
"The skip causes MATLAB to flush after each byte."
跳过可能是使用 fseek() 实现的，并且 fseek() 强制将数据刷新到磁盘。
它可以解释为什么跳过的写作速度非常慢。

MATLAB fwrite with skip slow

MATLAB fwrite with skip slow

file-io

matlab

fwrite