WebClient 中的汉字 html 与网站中的实际汉字不同

Question

因此，我试图从名为 Kanji-A-Day.com 的网站获取部分文本，但我遇到了问题。

你看，我正在尝试从网站上获取每日汉字，我能够将 HTML 缩小到我想要的范围，但字符似乎不同..?

What it looks like

What it should look like

更奇怪的是，第二张图是我直接从网站上复制粘贴出来的，所以不是字体问题。

这是我用来获取角色的代码：

public void UpdateDailyKanji() // Called at the initialization of a new main form
{
    string kanji;
    using (WebClient client = new WebClient()) // Grab the string 
        kanji = client.DownloadString("http://www.kanji-a-day.com/level4/index.php"); 
    // Trim the HTML to just the Kanji
    kanji = kanji.Remove(0, kanji.IndexOf(@"<div class=""glyph"">") + 19);
    kanji = kanji.Remove(kanji.IndexOf("</div>")-2);
    kanji = kanji.Trim();
    Text_DailyKanji.Text = kanji; // Set the Kanji
}

有人知道这里发生了什么吗？我猜这是一些 Unicode 的东西，但我不太了解它。

提前致谢。

Answer 1

您尝试以字符串形式下载的页面使用 charset=EUC-JP 编码，也称为 Japanese (EUC)（代码页 51932）。这个在页面headers.

中设置的很清楚

为什么 WebClient.DownloadString 返回的字符串使用错误的编码器编码？

MSDN 文档说明了这一点：

This method retrieves the specified resource. After it downloads the resource, the method uses the encoding specified in the Encoding property to convert the resource to a String.

因此，您必须事先知道将使用什么编码并指定它，设置 WebClient.Encoding 属性.

要验证这一点，请检查 .NET Reference Source for the WebClient.DownloadString 方法：

try {
    WebRequest request;
    byte [] data = DownloadDataInternal(address, out request);
    string stringData = GetStringUsingEncoding(request, data);
    if(Logging.On)Logging.Exit(Logging.Web, this, "DownloadString", stringData);
    return stringData;
    } finally {
        CompleteWebClientState();
    }

编码是使用 Request 设置设置的，而不是 Response 设置。
结果是，下载的字符串使用默认代码页进行编码。

您现在可以做的是：

下载页面两次，第一次检查WebClient编码是否与Html页面编码不匹配
Re-encode 具有正确编码的字符串，在基础 WebResponse 中设置。
不要使用WebClient，直接使用HttpClient或WebRequest。或者，如果您喜欢此工具，请创建自定义 WebClient class 以更直接的方式处理 WebRequest/WebResponse。

这是执行re-encoding任务的方法：
WebClient 返回的字符串被转换为字节数组并传递给 MemoryStream，然后 re-encoded 使用 StreamReader 和从 Content-Type: charset 响应中检索到的编码 Header.

编辑：
现在使用 Reflection 从底层 HttpWebResponse 获取页面 Encoding。这应该避免在解析远程响应定义的原始 CharacterSet 时出错。

using System.IO;
using System.Net;
using System.Reflection;
using System.Text;

public string WebClient_DownLoadString(Uri uri)
{
    using (var client = new WebClient())
    {
        // If Windows 7 - Windows Server 2008 R2
        ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;

        client.CachePolicy = new System.Net.Cache.RequestCachePolicy(System.Net.Cache.RequestCacheLevel.BypassCache);
        client.Headers.Add(HttpRequestHeader.Accept, "ext/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        client.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");
        client.Headers.Add(HttpRequestHeader.KeepAlive, "keep-alive");

        string result = client.DownloadString(uri);

        var flags = BindingFlags.Instance | BindingFlags.NonPublic;
        using (var response = (HttpWebResponse)client.GetType().GetField("m_WebResponse", flags).GetValue(client))
        {
            var pageEncoding = Encoding.GetEncoding(wc_response.CharacterSet);
            byte[] bytes = client.Encoding.GetBytes(result);
            using (var ms = new MemoryStream(bytes, 0, bytes.Length))
            using (var reader = new StreamReader(ms, pageEncoding))
            {
                ms.Position = 0;
                return reader.ReadToEnd();
            };
        };
    }
}

现在您的代码应该以正确的形式获取日语字符。

Uri uri = new Uri("http://www.kanji-a-day.com/level4/index.php", UriKind.Absolute);
string kanji = WebClient_DownLoadString(uri);

kanji = kanji.Remove(0, kanji.IndexOf("<div class=\"glyph\">") + 19);
kanji = kanji.Remove(kanji.IndexOf("</div>")-2);
kanji = kanji.Trim();

Text_DailyKanji.Text = kanji;

WebClient 中的汉字 html 与网站中的实际汉字不同

Kanji characters from WebClient html different from actual Kanji in website

html

c#

unicode

system.net

webclient