如何将网页解码为 UTF8 而不管其编码

Question

我在 C# 中使用 WebClient.DownloadFile 下载网页。它们下载正常，但编码通常不明确（不存在 BOM）。 Mozilla 通用字符集检测器（port 1, port 2) provides a system to detect the encoding of plaintext files using heuristics and this 提供了一种更简单的方法，可以识别更少的编码。

那么首先，HTML 网页是否甚至使用 Shift-JIS 和 EUC-KR 等奇怪的编码进行编码？如果不是，则可以使用一种更快的检测方法，假设某些内容是 ASCII/ANSI 或 UTF8。

其次，即使在检测到编码之后，如何将文件的 byte[] 解码为适当的 UTF8 字符串？在进行了一些字符串处理之后，我可以使用 UTF8 BOM 将文件保存回磁盘吗？或者我是否必须在 HTML 文件中添加额外的标签，如 <meta charset="utf-8"..">？

Answer 1

网页的字符集应该由content-type reponse header, especially the charset attribute描述，但通常不是。有时会有一个 <meta http-equiv="content-type" />，但即使缺少它，所有的赌注都会被取消，您需要去检测实际的编码。

看来您的前进方向是正确的。

are HTML web pages even encoded in strange encodings

这取决于您请求的页面。

how does one go about decoding the byte[] of the file into an appropriate UTF8 string?

你不想。 .NET 中的字符串在内部全部编码为 UTF-16，并且所有实用程序函数都使用该格式。

所以 string content = Encoding.GetEncoding(yourDetectedEncoding).GetString(contentBytes) 就可以了。

然后您可以将这个 UTF-16 编码的 content 字符串写回 UTF-8 编码的文件，BOM 为：

File.WriteAllText(path, content, Encoding.UTF8);

如何将网页解码为 UTF8 而不管其编码

How to decode a webpage into UTF8 regardless of its encoding

c#

encoding

webpage

utf-8

character-encoding