当 fread 包含 unicode 特殊符号时,为什么我从 fread 返回的字节数组小于文件本身的字节数?
Why is my byte array returned from fread smaller than the number of bytes in the file itself when it includes unicode special symbols?
我在 Matlab 的大型复杂项目中对特殊字符进行编码和解析 XML 时遇到了一些问题。我已经隔离了问题,虽然我仍然不知道如何解决它,但我认为它与这个问题有关。
考虑以下 XML 文件:
<?xml version="1.0"?>
<info>
<desc>
<channels>
<channel>
<label>accZ</label>
<unit>mg₀</unit>
<type>ACC</type>
</channel>
</channels>
</desc>
</info>
假设采用 Unix 风格的行结尾,此文件包含 175 个字节的数据。确实,当我在 Notepad++ 中打开它时,它就是这么说的。现在我有了一个 XML 解析函数,我几乎完全复制了 Mathworks 关于如何在 Matlab 中解析 XML (https://au.mathworks.com/help/matlab/ref/xmlread.html) 的解释。此功能运行良好且不是问题,包含它只是为了完整起见:
% parse a simplified (attribute-free) subset of XML into a MATLAB struct
function result = parse_xml_struct(str)
import org.xml.sax.InputSource
import javax.xml.parsers.*
import java.io.*
tmp = InputSource();
tmp.setCharacterStream(StringReader(str));
result = parseChildNodes(xmlread(tmp));
% this is part of xml2struct (slightly simplified)
function [children,ptext] = parseChildNodes(theNode)
% Recurse over node children.
children = struct;
ptext = [];
if theNode.hasChildNodes
childNodes = theNode.getChildNodes;
numChildNodes = childNodes.getLength;
for count = 1:numChildNodes
theChild = childNodes.item(count-1);
[text,name,childs] = getNodeData(theChild);
if (~strcmp(name,'#text') && ~strcmp(name,'#comment'))
if (isfield(children,name))
if (~iscell(children.(name)))
children.(name) = {children.(name)}; end
index = length(children.(name))+1;
children.(name){index} = childs;
if(~isempty(text))
children.(name){index} = text; end
else
children.(name) = childs;
if(~isempty(text))
children.(name) = text; end
end
elseif (strcmp(name,'#text'))
if (~isempty(regexprep(text,'[\s]*','')))
if (isempty(ptext))
ptext = text;
else
ptext = [ptext text];
end
end
end
end
end
end
% this is part of xml2struct (slightly simplified)
function [text,name,childs] = getNodeData(theNode)
% Create structure of node info.
name = char(theNode.getNodeName);
if ~isvarname(name)
name = regexprep(name,'[-]','_dash_');
name = regexprep(name,'[:]','_colon_');
name = regexprep(name,'[.]','_dot_');
end
[childs,text] = parseChildNodes(theNode);
if (isempty(fieldnames(childs)))
try
text = char(theNode.getData);
catch
end
end
end
end
现在进行测试:
finfo = dir('xml_example');
sz = finfo.bytes
fid = fopen('xml_example', 'r', 'ieee-le.l64');
data = fread(fid, sz, '*char');
data_size = size(data)
h = parse_xml_struct(data);
unit = h.info.desc.channels.channel.unit
和输出:
sz =
175
data_size =
173 1
unit =
'mg₀'
所以我以某种方式最终得到了正确的输出,但在此过程中丢失了 2 个字节。我不明白为什么会这样。
并且只是为了向自己证明是小下标 'o' 导致了文件大小与我的 data
数组中的字节数之间的差异,我将其删除从 XML 文件中获取以下内容:
sz =
172
data_size =
172 1
unit =
'mg'
仍然是 xml 标签的正确输出,现在文件大小和字节数组大小匹配。怎么回事?
更新
此外,如果我运行对一个2字节长的符号进行同样的测试,我仍然得到压缩现象。
<?xml version="1.0"?>
<info>
<desc>
<channels>
<channel>
<label>ωX</label>
<unit>mrad/s</unit>
<type>AUX</type>
</channel>
</channels>
</desc>
</info>
输出:
sz =
175
data_size =
174 1
unit =
'ωX'
₀
符号长度为三个字节,采用 UTF-8 编码 (0xE2 0x82 0x80
)。在 MATLAB 内部,由于 UTF-16 encoding (0x80 0x20
little-endian),它实际上是两个字节。
但是,由于 precision
of *char
was given to fread
, the returned data is returned as a char
array1. And to a char
array, regardless of the underlying encoding, ₀
is simply a single character when considering its size
:
If A
is a character vector of type char
, then size
returns the row vector [1 M]
where M
is the number of characters.
1 如果给出 sizeA
(根据文档),我假设来自 fread
的匹配矩阵大小的强烈断言仅适用于数字数据,因为从上面可以看出,字节数和字符数不一定是一对一的。
我在 Matlab 的大型复杂项目中对特殊字符进行编码和解析 XML 时遇到了一些问题。我已经隔离了问题,虽然我仍然不知道如何解决它,但我认为它与这个问题有关。
考虑以下 XML 文件:
<?xml version="1.0"?>
<info>
<desc>
<channels>
<channel>
<label>accZ</label>
<unit>mg₀</unit>
<type>ACC</type>
</channel>
</channels>
</desc>
</info>
假设采用 Unix 风格的行结尾,此文件包含 175 个字节的数据。确实,当我在 Notepad++ 中打开它时,它就是这么说的。现在我有了一个 XML 解析函数,我几乎完全复制了 Mathworks 关于如何在 Matlab 中解析 XML (https://au.mathworks.com/help/matlab/ref/xmlread.html) 的解释。此功能运行良好且不是问题,包含它只是为了完整起见:
% parse a simplified (attribute-free) subset of XML into a MATLAB struct
function result = parse_xml_struct(str)
import org.xml.sax.InputSource
import javax.xml.parsers.*
import java.io.*
tmp = InputSource();
tmp.setCharacterStream(StringReader(str));
result = parseChildNodes(xmlread(tmp));
% this is part of xml2struct (slightly simplified)
function [children,ptext] = parseChildNodes(theNode)
% Recurse over node children.
children = struct;
ptext = [];
if theNode.hasChildNodes
childNodes = theNode.getChildNodes;
numChildNodes = childNodes.getLength;
for count = 1:numChildNodes
theChild = childNodes.item(count-1);
[text,name,childs] = getNodeData(theChild);
if (~strcmp(name,'#text') && ~strcmp(name,'#comment'))
if (isfield(children,name))
if (~iscell(children.(name)))
children.(name) = {children.(name)}; end
index = length(children.(name))+1;
children.(name){index} = childs;
if(~isempty(text))
children.(name){index} = text; end
else
children.(name) = childs;
if(~isempty(text))
children.(name) = text; end
end
elseif (strcmp(name,'#text'))
if (~isempty(regexprep(text,'[\s]*','')))
if (isempty(ptext))
ptext = text;
else
ptext = [ptext text];
end
end
end
end
end
end
% this is part of xml2struct (slightly simplified)
function [text,name,childs] = getNodeData(theNode)
% Create structure of node info.
name = char(theNode.getNodeName);
if ~isvarname(name)
name = regexprep(name,'[-]','_dash_');
name = regexprep(name,'[:]','_colon_');
name = regexprep(name,'[.]','_dot_');
end
[childs,text] = parseChildNodes(theNode);
if (isempty(fieldnames(childs)))
try
text = char(theNode.getData);
catch
end
end
end
end
现在进行测试:
finfo = dir('xml_example');
sz = finfo.bytes
fid = fopen('xml_example', 'r', 'ieee-le.l64');
data = fread(fid, sz, '*char');
data_size = size(data)
h = parse_xml_struct(data);
unit = h.info.desc.channels.channel.unit
和输出:
sz =
175
data_size =
173 1
unit =
'mg₀'
所以我以某种方式最终得到了正确的输出,但在此过程中丢失了 2 个字节。我不明白为什么会这样。
并且只是为了向自己证明是小下标 'o' 导致了文件大小与我的 data
数组中的字节数之间的差异,我将其删除从 XML 文件中获取以下内容:
sz =
172
data_size =
172 1
unit =
'mg'
仍然是 xml 标签的正确输出,现在文件大小和字节数组大小匹配。怎么回事?
更新
此外,如果我运行对一个2字节长的符号进行同样的测试,我仍然得到压缩现象。
<?xml version="1.0"?>
<info>
<desc>
<channels>
<channel>
<label>ωX</label>
<unit>mrad/s</unit>
<type>AUX</type>
</channel>
</channels>
</desc>
</info>
输出:
sz =
175
data_size =
174 1
unit =
'ωX'
₀
符号长度为三个字节,采用 UTF-8 编码 (0xE2 0x82 0x80
)。在 MATLAB 内部,由于 UTF-16 encoding (0x80 0x20
little-endian),它实际上是两个字节。
但是,由于 precision
of *char
was given to fread
, the returned data is returned as a char
array1. And to a char
array, regardless of the underlying encoding, ₀
is simply a single character when considering its size
:
If
A
is a character vector of typechar
, thensize
returns the row vector[1 M]
whereM
is the number of characters.
1 如果给出 sizeA
(根据文档),我假设来自 fread
的匹配矩阵大小的强烈断言仅适用于数字数据,因为从上面可以看出,字节数和字符数不一定是一对一的。