HtmlAgilityPack 不在 c# 中获取 xpath

Question

之前，我使用这段代码，它可以获取网站的xpath。但是，今天我调试代码，我看到，它没有从网站 html 获取数据：webtruyen.com。我尝试查看网站。com/robots.txt。但它不怀疑。我尝试添加代理来获取数据，但 return 数据为空。我不知道如何从网站 webtruyen.com 获取 xpath。谁帮帮我？我想知道如何从网站 http://webtruyen.com 读取数据。

我的代码：

string url = "http://webtruyen.com";
var web = new HtmlWeb();
var doc = web.Load(url);
String temps = "";
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
     temps  = node.InnerHtml;
}

我调试，return：

InnerHtml 'doc.DocumentNode.InnerHtml' 抛出异常 'System.NullReferenceException' 字符串 {System.NullReferenceException}

我的代码使用代理：

string url = "http://webtruyen.com";
var web = new HtmlWeb();
webGet.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) Speedy Spider (http://www.entireweb.com/about/search_tech/speedy_spider/)";
var doc = web.Load(url);
String temps = "";
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
     temps  = node.InnerHtml;
}

Answer 1

我在使用 HtmlWeb.Load() 时遇到了同样的错误，但我可以使用 HttpWebRequest 轻松解决您的问题（TLDR：请参阅#3 了解工作代码）。

步骤 1) 使用以下代码：

HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
            using (Stream s = hwr.GetResponse().GetResponseStream())
            { }

你看到你实际上得到了 403 禁止错误 (WebException)。

步骤 2)

        HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
        HtmlDocument doc = new HtmlDocument();
        try
        {
            using (Stream s = hwr.GetResponse().GetResponseStream())
            { }
        }
        catch (WebException wx)
        {
            doc.LoadHtml(new StreamReader(wx.Response.GetResponseStream()).ReadToEnd());
        }

在 doc.DocumentNode.OuterHtml，您会看到禁止错误的 HTML 以及在您的浏览器上设置 cookie 并刷新它的 JavaScript。

3) 因此，为了在手动浏览器之外加载页面，您必须手动设置该 cookie 并重新访问它。意思是：

        string cookie = string.Empty;
        HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
        try
        {
            using (Stream s = hwr.GetResponse().GetResponseStream())
            { }
        }
        catch (WebException wx)
        {
            cookie = Regex.Match(new StreamReader(wx.Response.GetResponseStream()).ReadToEnd(), "document.cookie = '(.*?)';").Groups[1].Value;
        }
        hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
        hwr.Headers.Add("Cookie", cookie);
        HtmlDocument doc = new HtmlDocument();
        using (Stream s = hwr.GetResponse().GetResponseStream())
        using (StreamReader sr = new StreamReader(s))
        {
            doc.LoadHtml(sr.ReadToEnd());
        }

你得到了页面:)

这个故事的寓意是，如果您的浏览器可以，那么您也可以。

HtmlAgilityPack 不在 c# 中获取 xpath

HtmlAgilityPack don't get xpath in c#

c#

xpath

html-agility-pack