当 fread 包含 unicode 特殊符号时,为什么我从 fread 返回的字节数组小于文件本身的字节数?

Why is my byte array returned from fread smaller than the number of bytes in the file itself when it includes unicode special symbols?

我在 Matlab 的大型复杂项目中对特殊字符进行编码和解析 XML 时遇到了一些问题。我已经隔离了问题,虽然我仍然不知道如何解决它,但我认为它与这个问题有关。

考虑以下 XML 文件:

<?xml version="1.0"?>
<info>
    <desc>
        <channels>
            <channel>
                <label>accZ</label>
                <unit>mg₀</unit>
                <type>ACC</type>
            </channel>
        </channels>
    </desc>
</info>

假设采用 Unix 风格的行结尾,此文件包含 175 个字节的数据。确实,当我在 Notepad++ 中打开它时,它就是这么说的。现在我有了一个 XML 解析函数,我几乎完全复制了 Mathworks 关于如何在 Matlab 中解析 XML (https://au.mathworks.com/help/matlab/ref/xmlread.html) 的解释。此功能运行良好且不是问题,包含它只是为了完整起见:

% parse a simplified (attribute-free) subset of XML into a MATLAB struct
function result = parse_xml_struct(str)
import org.xml.sax.InputSource
import javax.xml.parsers.*
import java.io.*
tmp = InputSource();
tmp.setCharacterStream(StringReader(str));
result = parseChildNodes(xmlread(tmp));

% this is part of xml2struct (slightly simplified)
    function [children,ptext] = parseChildNodes(theNode)
        % Recurse over node children.
        children = struct;
        ptext = [];
        if theNode.hasChildNodes
            childNodes = theNode.getChildNodes;
            numChildNodes = childNodes.getLength;
            for count = 1:numChildNodes
                theChild = childNodes.item(count-1);
                [text,name,childs] = getNodeData(theChild);
                if (~strcmp(name,'#text') && ~strcmp(name,'#comment'))
                    if (isfield(children,name))
                        if (~iscell(children.(name)))
                            children.(name) = {children.(name)}; end
                        index = length(children.(name))+1;
                        children.(name){index} = childs;
                        if(~isempty(text))
                            children.(name){index} = text; end
                    else
                        children.(name) = childs;
                        if(~isempty(text))
                            children.(name) = text; end
                    end
                elseif (strcmp(name,'#text'))
                    if (~isempty(regexprep(text,'[\s]*','')))
                        if (isempty(ptext))
                            ptext = text;
                        else
                            ptext = [ptext text];
                        end
                    end
                end
            end
        end
    end

% this is part of xml2struct (slightly simplified)
    function [text,name,childs] = getNodeData(theNode)
        % Create structure of node info.
        name = char(theNode.getNodeName);
        if ~isvarname(name)
            name = regexprep(name,'[-]','_dash_');
            name = regexprep(name,'[:]','_colon_');
            name = regexprep(name,'[.]','_dot_');
        end
        [childs,text] = parseChildNodes(theNode);
        if (isempty(fieldnames(childs)))
            try
                text = char(theNode.getData);
            catch
            end
        end
    end
end

现在进行测试:

finfo = dir('xml_example');
sz = finfo.bytes
fid = fopen('xml_example', 'r', 'ieee-le.l64');
data = fread(fid, sz, '*char');
data_size = size(data)
h = parse_xml_struct(data);
unit = h.info.desc.channels.channel.unit

和输出:

sz =

   175


data_size =

   173     1


unit =

    'mg₀'

所以我以某种方式最终得到了正确的输出,但在此过程中丢失了 2 个字节。我不明白为什么会这样。

并且只是为了向自己证明是小下标 'o' 导致了文件大小与我的 data 数组中的字节数之间的差异,我将其删除从 XML 文件中获取以下内容:

sz =

   172


data_size =

   172     1


unit =

    'mg'

仍然是 xml 标签的正确输出,现在文件大小和字节数组大小匹配。怎么回事?

更新

此外,如果我运行对一个2字节长的符号进行同样的测试,我仍然得到压缩现象。

<?xml version="1.0"?>
<info>
    <desc>
        <channels>
            <channel>
                <label>ωX</label>
                <unit>mrad/s</unit>
                <type>AUX</type>
            </channel>
        </channels>
    </desc>
</info>

输出:

sz =

   175


data_size =

   174     1


unit =

'ωX'

符号长度为三个字节,采用 UTF-8 编码 (0xE2 0x82 0x80)。在 MATLAB 内部,由于 UTF-16 encoding (0x80 0x20 little-endian),它实际上是两个字节。

但是,由于 precision of *char was given to fread, the returned data is returned as a char array1. And to a char array, regardless of the underlying encoding, is simply a single character when considering its size:

If A is a character vector of type char, then size returns the row vector [1 M] where M is the number of characters.

1 如果给出 sizeA(根据文档),我假设来自 fread 的匹配矩阵大小的强烈断言仅适用于数字数据,因为从上面可以看出,字节数和字符数不一定是一对一的。