XPath 检索 <a> href、文本和 <span>
XPath retrieving <a> href, text, and <span>
我目前正在抓取一些网站并从中检索信息以存储到数据库中供以后使用。我正在使用 HtmlAgilityPack,我现在已经成功地为几个网站完成了这项工作,但出于某种原因,这个问题给我带来了问题。我对 XPath 语法还很陌生,所以我可能搞砸了。
这是我要检索的网站代码:
<form ... id="_subcat_ids_">
<input ....>
<ul ...>
<li ....>
<input .....>
<a class="facet-seleection multiselect-facets "
.... href="INeedThisHref#1">
Text I Need //need to retrieve this text between then <a></a>
<span class="subtle-note">(2)</span> //I Need that number from inside the span
</a>
</li>
<li ....>
<input .....>
<a class="facet-seleection multiselect-facets "
.... href="INeedThisHref#2">
Text I Need #2 //need to retrieve this text between then <a></a>
<span class="subtle-note">(6)</span> //I Need that number from inside the span
</a>
</li>
其中每一个都代表页面上的一个项目,但我只对每个 <a></a>
发生的事情感兴趣。我想从 <a>
中检索该 href 值,然后是开始和结束之间的文本,然后我需要 <span>
中的文本。我省略了其他标签内的内容,因为它们不能帮助唯一地标识每个项目,<a>
内的 class 是它们唯一共享的内容,它们都在 [=18= 内] 与 id="_subcat_ids_"
.
这是我的代码:
try
{
string fullUrl = "...";
HtmlWeb web = new HtmlWeb();
ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 | SecurityProtocolType.Tls | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12;
HtmlDocument html = web.Load(fullUrl);
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//form[@id='_subcat_ids_']")) //this gets me into the form
{
foreach (HtmlNode node2 in node.SelectNodes(".//a[@class='facet-selection multiselect-facets ']")) //this should get me into the the <a> tags, but it is throwing a fit with 'object reference not set to an instance of an object'
{
//get the href
string tempHref = node2.GetAttributeValue("href", string.Empty);
//get the text between <a>
string tempCat = node2.InnerText.Trim();
//get the text between <span>
string tempNum = node2.SelectSingleNode(".//span[@class='subtle-note']").InnerText.Trim();
}
}
}
catch (Exception ex)
{
Console.Write("\nError: " + ex.ToString());
}
第一个 foreach 循环没有错误,但第二个 foreach 循环在我的第二个 foreach 循环所在的行给了我 object reference not set to an instance of an object
。就像我之前提到的,我对这种语法还是陌生的,我在另一个网站上使用过这种方法并取得了巨大成功,但我在这个网站上遇到了一些麻烦。任何提示将不胜感激。
好吧,我想出来了,这是代码
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//form[@id='_subcat_ids_']"))
{
//get the categories, store in list
foreach (HtmlNode node2 in node.SelectNodes("..//a[@class='facet-selection multiselect-facets ']//text()[normalize-space() and not(ancestor::span)]"))
{
string tempCat = node2.InnerText.Trim();
categoryList.Add(tempCat);
Console.Write("\nCategory: " + tempCat);
}
foreach (HtmlNode node3 in node.SelectNodes("..//a[@class='facet-selection multiselect-facets ']"))
{
//get href for each category, store in list
string tempHref = node3.GetAttributeValue("href", string.Empty);
LinkCatList.Add(tempHref);
Console.Write("\nhref: " + tempHref);
//get the number of items from categories, store in list
string tempNum = node3.SelectSingleNode(".//span[@class='subtle-note']").InnerText.Trim();
string tp = tempNum.Replace("(", "");
tempNum = tp;
tp = tempNum.Replace(")", "");
tempNum = tp;
Console.Write("\nNumber of items: " + tempNum + "\n\n");
}
}
很有魅力
我目前正在抓取一些网站并从中检索信息以存储到数据库中供以后使用。我正在使用 HtmlAgilityPack,我现在已经成功地为几个网站完成了这项工作,但出于某种原因,这个问题给我带来了问题。我对 XPath 语法还很陌生,所以我可能搞砸了。
这是我要检索的网站代码:
<form ... id="_subcat_ids_">
<input ....>
<ul ...>
<li ....>
<input .....>
<a class="facet-seleection multiselect-facets "
.... href="INeedThisHref#1">
Text I Need //need to retrieve this text between then <a></a>
<span class="subtle-note">(2)</span> //I Need that number from inside the span
</a>
</li>
<li ....>
<input .....>
<a class="facet-seleection multiselect-facets "
.... href="INeedThisHref#2">
Text I Need #2 //need to retrieve this text between then <a></a>
<span class="subtle-note">(6)</span> //I Need that number from inside the span
</a>
</li>
其中每一个都代表页面上的一个项目,但我只对每个 <a></a>
发生的事情感兴趣。我想从 <a>
中检索该 href 值,然后是开始和结束之间的文本,然后我需要 <span>
中的文本。我省略了其他标签内的内容,因为它们不能帮助唯一地标识每个项目,<a>
内的 class 是它们唯一共享的内容,它们都在 [=18= 内] 与 id="_subcat_ids_"
.
这是我的代码:
try
{
string fullUrl = "...";
HtmlWeb web = new HtmlWeb();
ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 | SecurityProtocolType.Tls | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12;
HtmlDocument html = web.Load(fullUrl);
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//form[@id='_subcat_ids_']")) //this gets me into the form
{
foreach (HtmlNode node2 in node.SelectNodes(".//a[@class='facet-selection multiselect-facets ']")) //this should get me into the the <a> tags, but it is throwing a fit with 'object reference not set to an instance of an object'
{
//get the href
string tempHref = node2.GetAttributeValue("href", string.Empty);
//get the text between <a>
string tempCat = node2.InnerText.Trim();
//get the text between <span>
string tempNum = node2.SelectSingleNode(".//span[@class='subtle-note']").InnerText.Trim();
}
}
}
catch (Exception ex)
{
Console.Write("\nError: " + ex.ToString());
}
第一个 foreach 循环没有错误,但第二个 foreach 循环在我的第二个 foreach 循环所在的行给了我 object reference not set to an instance of an object
。就像我之前提到的,我对这种语法还是陌生的,我在另一个网站上使用过这种方法并取得了巨大成功,但我在这个网站上遇到了一些麻烦。任何提示将不胜感激。
好吧,我想出来了,这是代码
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//form[@id='_subcat_ids_']"))
{
//get the categories, store in list
foreach (HtmlNode node2 in node.SelectNodes("..//a[@class='facet-selection multiselect-facets ']//text()[normalize-space() and not(ancestor::span)]"))
{
string tempCat = node2.InnerText.Trim();
categoryList.Add(tempCat);
Console.Write("\nCategory: " + tempCat);
}
foreach (HtmlNode node3 in node.SelectNodes("..//a[@class='facet-selection multiselect-facets ']"))
{
//get href for each category, store in list
string tempHref = node3.GetAttributeValue("href", string.Empty);
LinkCatList.Add(tempHref);
Console.Write("\nhref: " + tempHref);
//get the number of items from categories, store in list
string tempNum = node3.SelectSingleNode(".//span[@class='subtle-note']").InnerText.Trim();
string tp = tempNum.Replace("(", "");
tempNum = tp;
tp = tempNum.Replace(")", "");
tempNum = tp;
Console.Write("\nNumber of items: " + tempNum + "\n\n");
}
}
很有魅力