Pharo 中的 XMLParser 声明 U+00A0 是 "Invalid UTF-8"
XMLParser in Pharo Claims U+00A0 is "Invalid UTF-8"
给定输入:
<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<sms body=". what" />
其中“.”后面的字符在 sms 标签的 body 属性中是 U+00A0;
我收到错误:
XMLEncodingException: Invalid UTF-8 character encoding (line 2) (column 13)
IIUC,该字符的 UTF-8 表示为 0xC2 0xA0
per Wikipedia。果然输入的第72和73字节分别是194和160
这似乎是 XMLParser 中的错误,还是我遗漏了什么?
感谢蒙蒂前来救援on the Pharo User's list:
You're double decoding. Use onFileNamed:/parseFileNamed: instead (and
the DOM printToFileNamed: family of messages when writing) and let
XMLParser take care this for you, or disable XMLParser decoding before
parsing with #decodesCharacters:.
Longer explanation:
The class #on:/#parse: take either a string or a stream (read the
definitions). You gave it a FileReference, but because the argument is
tested with isString and sent #readStream otherwise, it didn't blowup
then.
File refs sent #readStream return file streams that do automatic
decoding. But XMLParser automatically attempts its own decoding too,
if:
The input starts with a BOM or it can be inferred by null bytes
before or after the first non-null byte.
There is an encoding declaration with a non-UTF-8 encoding.
There is a UTF-8 encoding declaration but the stream is not a normal
ReadStream (your case).
So it gets decoded twice, and the decoded value of the char causes the
error. I'll consider changing the heuristic to make less eager to
decode.
给定输入:
<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<sms body=". what" />
其中“.”后面的字符在 sms 标签的 body 属性中是 U+00A0;
我收到错误:
XMLEncodingException: Invalid UTF-8 character encoding (line 2) (column 13)
IIUC,该字符的 UTF-8 表示为 0xC2 0xA0
per Wikipedia。果然输入的第72和73字节分别是194和160
这似乎是 XMLParser 中的错误,还是我遗漏了什么?
感谢蒙蒂前来救援on the Pharo User's list:
You're double decoding. Use onFileNamed:/parseFileNamed: instead (and the DOM printToFileNamed: family of messages when writing) and let XMLParser take care this for you, or disable XMLParser decoding before parsing with #decodesCharacters:.
Longer explanation:
The class #on:/#parse: take either a string or a stream (read the definitions). You gave it a FileReference, but because the argument is tested with isString and sent #readStream otherwise, it didn't blowup then.
File refs sent #readStream return file streams that do automatic decoding. But XMLParser automatically attempts its own decoding too, if:
The input starts with a BOM or it can be inferred by null bytes before or after the first non-null byte.
There is an encoding declaration with a non-UTF-8 encoding.
There is a UTF-8 encoding declaration but the stream is not a normal ReadStream (your case).
So it gets decoded twice, and the decoded value of the char causes the error. I'll consider changing the heuristic to make less eager to decode.