使用 htmlagilitypack 'Value cannot be null.' 选择 div 内容

Question

我正在尝试抓取 div:

中的内容

<div itemprop="articleBody">random, unique content in this div, different each time</div>

我的代码尝试获取上述div之间的内容

 var html = "random url eachtime.com";
 HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
 doc.LoadHtml(html);
 var nodes = doc.DocumentNode.SelectNodes("div[@itemprop=\"articleBody\"]");
 var inntertexts = nodes.Select(node => node.InnerText);
 articletext.Text = inntertexts.ToString();

当我访问网页以抓取 div 之间的内容时，我得到以下内容

exception...Value cannot be null. Parameter name: source

我也试过 xpath 路径:

/html[1]/body[1]/div[3]/div[2]/div[3]/div[3]/div[5]/div[1]/div[1]/div[1]

我正在尝试从以下 link 获取文章正文：查看源代码：http://www.dailymail.co.uk/sciencetech/article-4408856/Samsung-building-flip-phone-TWO-screens.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490

我该如何解决这个问题才能获取 div 的内容？

Answer 1

可能是 div 标签在其他标签内。如果是这样，您可以使用 "//div[@itemprop=\"articleBody\"]" （div 标记前面的斜杠）。

class Program
{
    static void Main(string[] args) => Task.Run(() => MainAsync(args)).Wait();

    static async Task MainAsync(string[] args)
    {
        var html = await GetResponseFromURI(new Uri("http://www.dailymail.co.uk/sciencetech/article-4408856/Samsung-building-flip-phone-TWO-screens.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490"));
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);
        var nodes = doc.DocumentNode.SelectNodes("//div[@itemprop=\"articleBody\"]");
        if (nodes != null)
        {
            Console.WriteLine(nodes.Select(node => node.InnerText).FirstOrDefault());
        }
        Console.ReadLine();
    }

    static async Task<string> GetResponseFromURI(Uri uri)
    {
        var response = "";
        using (var client = new HttpClient())
        {
            HttpResponseMessage result = await client.GetAsync(uri);
            if (result.IsSuccessStatusCode)
                response = await result.Content.ReadAsStringAsync();
        }
        return response;
    }
}

如果 div 标签根本没有 itemprop=\"articleBody\"，您将不得不使用 null 检查。

使用 htmlagilitypack 'Value cannot be null.' 选择 div 内容

Selecting div content with htmlagilitypack 'Value cannot be null.'

c#

html-agility-pack