为 SelectNodes 获取正确的 XPath

Question

我刚开始使用 HtmlAgilityPack 从网站上抓取一些文本。我进行了试验，发现在使用 SelectNodes 方法时，某些网站比其他网站更容易获得正确的 XPath。我相信我做错了什么，但无法弄清楚。

例如，在 Google Chrome 中探索 DOM 时，我可以复制 XPath：//*[@id="page"]/span/table[7]/tbody/tr[1]/td/span[2]/a 然后我会做类似的事情..

var search = doc.DocumentNode.SelectNodes("//[@id=\"page\"]//span//table//tr//td//span//a" 在 foreach loop 中使用 search 时，出现空引用错误，并且调试器确实说 search 为空。所以我假设 XPath 是错误的..（或者我在做其他完全错误的事情）所以我的问题是我如何为 HtmlAgilityPack 获得正确的 XPath 来找到这些节点？

Answer 1

根据您在上一条评论中的请求，html 仅在 http get 请求为 returns 后才完全呈现。

几个 javascript 调用将 html 块插入到文档中。

您需要其中的以下内容：loadCompanyProfileData('ContactInfo')，它生成的 http get 请求如下所示：

http://financials.morningstar.com/cmpind/company-profile/component.action?component=ContactInfo&t=XNAS:AAPL&region=usa&culture=en-US&cur=&_=1465809033745。

此 returns 电子邮件，您可以使用如下代码提取该电子邮件： HtmlWeb w = new HtmlWeb(); var doc = w.Load("http://financials.morningstar.com/cmpind/company-profile/component.action?component=ContactInfo&t=XNAS:AAPL&region=usa&culture=en-US&cur=&_=1465809033745");

        var emails = doc.DocumentNode.CssSelect("a")
            .Where(a => a.GetAttributeValue("href")
                .StartsWith("mailto:"))
                .Select(a => a.GetAttributeValue("href")
                    .Replace("mailto:", string.Empty));

emails 最终包含 1 个元素，即 investor_relations@apple.com。

您的问题是确定 loadCompanyProfileData javascript 函数为每个不同的公司使用的 "cur" 参数应该是什么。

我无法在代码中找到 where/how 是否生成了此参数。一种替代方法是执行浏览器模拟器（如 selenium web driver port for c#），这样您就可以执行 javascript 代码 - 并且运行为每个公司请求调用 loadCompanyProfileData('ContactInfo')。

但我也无法让它正常工作，我的网络驱动器脚本执行看起来不起作用。

为 SelectNodes 获取正确的 XPath

Get Proper XPath for SelectNodes

c#

xpath

html-agility-pack