带有汉字文本内容的 XmlDocument 未使用 XmlTextWriter 正确编码为 ISO-8859-1

XmlDocument with Kanji text content is not encoded correctly to ISO-8859-1 using XmlTextWriter

我有一个 XmlDocument 在其文本内容中包含汉字,我需要使用 ISO-8859-1 编码将其写入流。当我这样做时,none 个汉字字符被正确编码,而是被替换为“??”。

这里是演示如何从 XmlDocument 编写 XML 的示例代码:

MemoryStream mStream = new MemoryStream();
Encoding enc = Encoding.GetEncoding("ISO-8859-1");
XmlTextWriter writer = new XmlTextWriter(mStream,enc);
doc.WriteTo(writer);
writer.Flush();
mStream.Flush();
mStream.Position = 0;
StreamReader sReader = new StreamReader(mStream, enc);
String formattedXML = sReader.ReadToEnd();

在这种特定情况下如何正确编码汉字?

如评论中所述,? 字符出现是因为编码 ISO-8859-1, so it substitutes ? as a fallback character. Encoding fallbacks are discussed in the Documentation Remarks for Encoding:

不支持汉字字符

Note that the encoding classes allow errors (unsupported characters) to:

  • Silently change to a "?" character.
  • Use a "best fit" character.
  • Change to an application-specific behavior through use of the EncoderFallback and DecoderFallback classes with the U+FFFD Unicode replacement character.

这就是您看到的行为。

但是,即使 ISO-8859-1 不支持汉字字符,您也可以通过切换到 XmlWriter.Create(Stream, XmlWriterSettings) and setting your encoding on XmlWriterSettings.Encoding 返回的较新的 XmlWriter 来获得更好的结果,如下所示:

MemoryStream mStream = new MemoryStream();

var enc = Encoding.GetEncoding("ISO-8859-1");
var settings = new XmlWriterSettings
{
    Encoding = enc,
    CloseOutput = false,
    // Remove to enable the XML declaration if you want it.  XmlTextWriter doesn't include it automatically.
    OmitXmlDeclaration = true,  
};
using (var writer = XmlWriter.Create(mStream, settings))
{
    doc.WriteTo(writer);
}

mStream.Position = 0;
var sReader = new StreamReader(mStream, enc);
var formattedXML = sReader.ReadToEnd();

通过设置 XmlWriterSettingsEncoding 属性,XML 编写器将在当前编码不支持字符时被告知 并自动将其替换为 XML character entity reference 而不是一些硬编码的后备。

例如假设您 XML 如下所示:

<Root>
  <string>畑 はたけ hatake "field of crops"</string>
</Root>

然后您的代码将输出以下内容,将所有汉字映射到单个回退字符:

<Root><string>? ??? hatake "field of crops"</string></Root>

而新版本将输出:

<Root><string>&#x7551; &#x306F;&#x305F;&#x3051; hatake "field of crops"</string></Root>

注意到汉字字符已替换为 &#x7551; 等字符实体了吗?所有兼容的 XML 解析器都将识别并重建这些字符,因此尽管您的首选编码不支持汉字,但不会丢失任何信息。

最后,作为旁注,documentation for XmlTextWriter 指出:

Starting with the .NET Framework 2.0, we recommend that you use the System.Xml.XmlWriter class instead.

所以用 XmlWriter 替换它通常是个好主意。

示例 .Net fiddle 展示了两个 writer 的用法,并断言 XmlWriter 生成的 XML 在语义上等同于原始 XML,尽管对字符进行了转义。