从 URL 加载 html 的两种方法？

Question

为了从 URL 加载 HTML，我使用下面的方法

public HtmlDocument DownloadSource(string url)
{
    try
    {
        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.LoadHtml(DownloadString(url));
        return doc;
    }
    catch (Exception e)
    {
        if (Task.Error == null)
            Task.Error = e;
        Task.Status = TaskStatuses.Error;
        Done = true;
        return null;
    }
}

但是今天上面的代码突然停止工作了。我发现了另一种方法并且它工作正常。

HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url.ToString());

现在我只想知道这两种方法的区别

Answer 1

现在看来 User-Agent header 对 your site 是强制性的。

HtmlAgilityPack 一切正常，但您应该更改 DownloadString(url) 方法。如果您使用 Fiddler 检查请求，您将看到它 returns 403 Forbidden:

解决方案是在请求中添加任何 User-Agent header:

using HtmlAgilityPack;
using System;
using System.Net;

class Program
{
    static void Main()
    {
        var doc = DownloadSource("https://videohive.net/item/inspired-slideshow/21544630");
        Console.ReadKey();
    }

    public static HtmlDocument DownloadSource(string url)
    {
        try
        {
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(DownloadString(url));
            return doc;
        }
        catch (Exception e)
        {
            // exception handling here
        }
        return null;
    }

    static String DownloadString(String url)
    {
        WebClient client = new WebClient();
        client.Headers.Add("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:x.x.x) Gecko/20041107 Firefox/x.x");
        return client.DownloadString(url);
    }
}

从 URL 加载 html 的两种方法？

Two method for loading html from URL?

html-agility-pack