提取纯文本网站 html

Question

我正在尝试使用以下代码访问网站的内容：

HttpClient httpClient = new HttpClient();
string htmlresult = "";

var response = await httpClient.GetAsync(url);

if (response.IsSuccessStatusCode)
{
    htmlresult = await response.Content.ReadAsStringAsync();
}

return htmlresult;

它给了我正确的 html 除了 https://www.yahoo.com，这可能给我一个加密的字符串而不是普通的 html，如下所示。

   ‹       Ä½ç–ãF¶.øÿ<»Ž4Kj“ð¦ÔÒ½÷ž·îÊO0$ Úž~÷   4@D™U:ëNgK"bÛÄïÿõr¯4^ô

如何从这个加密文本中得到简单的html？

Answer 1

Yahoo 使用Accept-Encoding: gzip, deflate, br，所以你的案例中的内容是g-zipped。快速修复您的代码 - 启用自动解压缩：

private async Task<String> GetUrl(string url)
{
    HttpClientHandler handler = new HttpClientHandler()
    {
        AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate
    };

    HttpClient httpClient = new HttpClient(handler);

    string htmlresult = "";

    var response = await httpClient.GetAsync(url);

    if (response.IsSuccessStatusCode)
    {
        htmlresult = await response.Content.ReadAsStringAsync();
    }

    return htmlresult;
}

提取纯文本网站 html

Extract web site plain html

html

c#

yahoo

https

httpclient