使用 htmlagilitypack 'Value cannot be null.' 选择 div 内容
Selecting div content with htmlagilitypack 'Value cannot be null.'
我正在尝试抓取 div:
中的内容
<div itemprop="articleBody">random, unique content in this div, different each time</div>
我的代码尝试获取上述div之间的内容
var html = "random url eachtime.com";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("div[@itemprop=\"articleBody\"]");
var inntertexts = nodes.Select(node => node.InnerText);
articletext.Text = inntertexts.ToString();
当我访问网页以抓取 div 之间的内容时,我得到以下内容
exception...Value cannot be null.
Parameter name: source
我也试过 xpath 路径:
/html[1]/body[1]/div[3]/div[2]/div[3]/div[3]/div[5]/div[1]/div[1]/div[1]
我正在尝试从以下 link 获取文章正文:查看源代码:http://www.dailymail.co.uk/sciencetech/article-4408856/Samsung-building-flip-phone-TWO-screens.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490
我该如何解决这个问题才能获取 div 的内容?
可能是 div 标签在其他标签内。如果是这样,您可以使用 "//div[@itemprop=\"articleBody\"]"
(div 标记前面的斜杠)。
class Program
{
static void Main(string[] args) => Task.Run(() => MainAsync(args)).Wait();
static async Task MainAsync(string[] args)
{
var html = await GetResponseFromURI(new Uri("http://www.dailymail.co.uk/sciencetech/article-4408856/Samsung-building-flip-phone-TWO-screens.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490"));
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//div[@itemprop=\"articleBody\"]");
if (nodes != null)
{
Console.WriteLine(nodes.Select(node => node.InnerText).FirstOrDefault());
}
Console.ReadLine();
}
static async Task<string> GetResponseFromURI(Uri uri)
{
var response = "";
using (var client = new HttpClient())
{
HttpResponseMessage result = await client.GetAsync(uri);
if (result.IsSuccessStatusCode)
response = await result.Content.ReadAsStringAsync();
}
return response;
}
}
如果 div 标签根本没有 itemprop=\"articleBody\"
,您将不得不使用 null 检查。
我正在尝试抓取 div:
中的内容<div itemprop="articleBody">random, unique content in this div, different each time</div>
我的代码尝试获取上述div之间的内容
var html = "random url eachtime.com";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("div[@itemprop=\"articleBody\"]");
var inntertexts = nodes.Select(node => node.InnerText);
articletext.Text = inntertexts.ToString();
当我访问网页以抓取 div 之间的内容时,我得到以下内容
exception...Value cannot be null. Parameter name: source
我也试过 xpath 路径:
/html[1]/body[1]/div[3]/div[2]/div[3]/div[3]/div[5]/div[1]/div[1]/div[1]
我正在尝试从以下 link 获取文章正文:查看源代码:http://www.dailymail.co.uk/sciencetech/article-4408856/Samsung-building-flip-phone-TWO-screens.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490
我该如何解决这个问题才能获取 div 的内容?
可能是 div 标签在其他标签内。如果是这样,您可以使用 "//div[@itemprop=\"articleBody\"]"
(div 标记前面的斜杠)。
class Program
{
static void Main(string[] args) => Task.Run(() => MainAsync(args)).Wait();
static async Task MainAsync(string[] args)
{
var html = await GetResponseFromURI(new Uri("http://www.dailymail.co.uk/sciencetech/article-4408856/Samsung-building-flip-phone-TWO-screens.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490"));
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//div[@itemprop=\"articleBody\"]");
if (nodes != null)
{
Console.WriteLine(nodes.Select(node => node.InnerText).FirstOrDefault());
}
Console.ReadLine();
}
static async Task<string> GetResponseFromURI(Uri uri)
{
var response = "";
using (var client = new HttpClient())
{
HttpResponseMessage result = await client.GetAsync(uri);
if (result.IsSuccessStatusCode)
response = await result.Content.ReadAsStringAsync();
}
return response;
}
}
如果 div 标签根本没有 itemprop=\"articleBody\"
,您将不得不使用 null 检查。