带有汉字文本内容的 XmlDocument 未使用 XmlTextWriter 正确编码为 ISO-8859-1
XmlDocument with Kanji text content is not encoded correctly to ISO-8859-1 using XmlTextWriter
我有一个 XmlDocument
在其文本内容中包含汉字,我需要使用 ISO-8859-1 编码将其写入流。当我这样做时,none 个汉字字符被正确编码,而是被替换为“??”。
这里是演示如何从 XmlDocument
编写 XML 的示例代码:
MemoryStream mStream = new MemoryStream();
Encoding enc = Encoding.GetEncoding("ISO-8859-1");
XmlTextWriter writer = new XmlTextWriter(mStream,enc);
doc.WriteTo(writer);
writer.Flush();
mStream.Flush();
mStream.Position = 0;
StreamReader sReader = new StreamReader(mStream, enc);
String formattedXML = sReader.ReadToEnd();
在这种特定情况下如何正确编码汉字?
如评论中所述,?
字符出现是因为编码 ISO-8859-1
, so it substitutes ?
as a fallback character. Encoding fallbacks are discussed in the Documentation Remarks for Encoding
:
不支持汉字字符
Note that the encoding classes allow errors (unsupported characters) to:
- Silently change to a "?" character.
- Use a "best fit" character.
- Change to an application-specific behavior through use of the
EncoderFallback
and DecoderFallback
classes with the U+FFFD Unicode replacement character.
这就是您看到的行为。
但是,即使 ISO-8859-1
不支持汉字字符,您也可以通过切换到 XmlWriter.Create(Stream, XmlWriterSettings)
and setting your encoding on XmlWriterSettings.Encoding
返回的较新的 XmlWriter
来获得更好的结果,如下所示:
MemoryStream mStream = new MemoryStream();
var enc = Encoding.GetEncoding("ISO-8859-1");
var settings = new XmlWriterSettings
{
Encoding = enc,
CloseOutput = false,
// Remove to enable the XML declaration if you want it. XmlTextWriter doesn't include it automatically.
OmitXmlDeclaration = true,
};
using (var writer = XmlWriter.Create(mStream, settings))
{
doc.WriteTo(writer);
}
mStream.Position = 0;
var sReader = new StreamReader(mStream, enc);
var formattedXML = sReader.ReadToEnd();
通过设置 XmlWriterSettings
的 Encoding
属性,XML 编写器将在当前编码不支持字符时被告知 并自动将其替换为 XML character entity reference 而不是一些硬编码的后备。
例如假设您 XML 如下所示:
<Root>
<string>畑 はたけ hatake "field of crops"</string>
</Root>
然后您的代码将输出以下内容,将所有汉字映射到单个回退字符:
<Root><string>? ??? hatake "field of crops"</string></Root>
而新版本将输出:
<Root><string>畑 はたけ hatake "field of crops"</string></Root>
注意到汉字字符已替换为 畑
等字符实体了吗?所有兼容的 XML 解析器都将识别并重建这些字符,因此尽管您的首选编码不支持汉字,但不会丢失任何信息。
最后,作为旁注,documentation for XmlTextWriter
指出:
Starting with the .NET Framework 2.0, we recommend that you use the System.Xml.XmlWriter class instead.
所以用 XmlWriter
替换它通常是个好主意。
示例 .Net fiddle 展示了两个 writer 的用法,并断言 XmlWriter
生成的 XML 在语义上等同于原始 XML,尽管对字符进行了转义。
我有一个 XmlDocument
在其文本内容中包含汉字,我需要使用 ISO-8859-1 编码将其写入流。当我这样做时,none 个汉字字符被正确编码,而是被替换为“??”。
这里是演示如何从 XmlDocument
编写 XML 的示例代码:
MemoryStream mStream = new MemoryStream();
Encoding enc = Encoding.GetEncoding("ISO-8859-1");
XmlTextWriter writer = new XmlTextWriter(mStream,enc);
doc.WriteTo(writer);
writer.Flush();
mStream.Flush();
mStream.Position = 0;
StreamReader sReader = new StreamReader(mStream, enc);
String formattedXML = sReader.ReadToEnd();
在这种特定情况下如何正确编码汉字?
如评论中所述,?
字符出现是因为编码 ISO-8859-1
, so it substitutes ?
as a fallback character. Encoding fallbacks are discussed in the Documentation Remarks for Encoding
:
Note that the encoding classes allow errors (unsupported characters) to:
- Silently change to a "?" character.
- Use a "best fit" character.
- Change to an application-specific behavior through use of the
EncoderFallback
andDecoderFallback
classes with the U+FFFD Unicode replacement character.
这就是您看到的行为。
但是,即使 ISO-8859-1
不支持汉字字符,您也可以通过切换到 XmlWriter.Create(Stream, XmlWriterSettings)
and setting your encoding on XmlWriterSettings.Encoding
返回的较新的 XmlWriter
来获得更好的结果,如下所示:
MemoryStream mStream = new MemoryStream();
var enc = Encoding.GetEncoding("ISO-8859-1");
var settings = new XmlWriterSettings
{
Encoding = enc,
CloseOutput = false,
// Remove to enable the XML declaration if you want it. XmlTextWriter doesn't include it automatically.
OmitXmlDeclaration = true,
};
using (var writer = XmlWriter.Create(mStream, settings))
{
doc.WriteTo(writer);
}
mStream.Position = 0;
var sReader = new StreamReader(mStream, enc);
var formattedXML = sReader.ReadToEnd();
通过设置 XmlWriterSettings
的 Encoding
属性,XML 编写器将在当前编码不支持字符时被告知 并自动将其替换为 XML character entity reference 而不是一些硬编码的后备。
例如假设您 XML 如下所示:
<Root>
<string>畑 はたけ hatake "field of crops"</string>
</Root>
然后您的代码将输出以下内容,将所有汉字映射到单个回退字符:
<Root><string>? ??? hatake "field of crops"</string></Root>
而新版本将输出:
<Root><string>畑 はたけ hatake "field of crops"</string></Root>
注意到汉字字符已替换为 畑
等字符实体了吗?所有兼容的 XML 解析器都将识别并重建这些字符,因此尽管您的首选编码不支持汉字,但不会丢失任何信息。
最后,作为旁注,documentation for XmlTextWriter
指出:
Starting with the .NET Framework 2.0, we recommend that you use the System.Xml.XmlWriter class instead.
所以用 XmlWriter
替换它通常是个好主意。
示例 .Net fiddle 展示了两个 writer 的用法,并断言 XmlWriter
生成的 XML 在语义上等同于原始 XML,尽管对字符进行了转义。