使用HtmlAgilityPack获取特定行列数据
Using HtmlAgilityPack to get a specific row and column data
这是我的table
<table class="DataRows" frame="myFrames" rules="Standard" width="100%">
<colgroup><col width="70" align="CENTER">
<col width="200" align="LEFT">
<col width="80" align="LEFT">
<col align="LEFT">
<col align="RIGHT">
</colgroup><thead>
<col width="70" align="CENTER">
<col width="200" align="LEFT">
<col width="80" align="LEFT">
<col align="LEFT">
<col align="RIGHT">
<thead>
<tr>
<td valign="TOP"><span class="classicBold"> 20 </span> Kg.
<td class="BOLD" valign="TOP" nowrap="">
PA Passion Foods Inc.
<td class="BOLD">Fax:
<td>
222-555666
<td class="BOLD">
Processed foods and juices
<tr>
<td><a target="_blank" href="">See on Map </a>
<td>
120 NW 157TH AVE
<td class="BOLD">Warehouse Hours:
<td colspan="2">
<tr>
<td>
<td><span class="BOLD">
Jacksonville,
</span>
FL 300000
<td class="BOLD">Url:
<td colspan="2">
<a target="_blank" href="">PA Passion</a>
  
<span class="BOLD">E-mail:</span>
zoro@xyz.com
<tr>
<td>
<td class="REDBOLD" colspan="4">
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
Nutrella
</span>
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
APPLE Foods, Constants
</span>
<tr>
<td>
<td colspan="4" align="LEFT"><span class="BOLD">
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We service:<span class="BOLD">
All occasions and hospitality services
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We sell :<span class="BOLD">
----
</span>
</td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td></td></tr>
</thead>
</table>
我正在使用下面的代码Html 遍历我的 Html 文档中的每个节点
foreach (HtmlNode node in htmlAgilityPackDoc.DocumentNode.SelectNodes("//table[contains(@class,'DataRows')]"))
{
}
当我使用下面的
node.SelectSingleNode(".//tr[1]/td[1]").InnerHtml
我得到以下 html
<span class="classicBold"> 20 </span> Kg.
<td class="BOLD" valign="TOP" nowrap="">
PA Passion Foods Inc.
<td class="BOLD">Fax:
<td>
222-555666
<td class="BOLD">
Processed foods and juices
<tr>
<td><a target="_blank" href="">See on Map </a>
<td>
120 NW 157TH AVE
<td class="BOLD">Warehouse Hours:
<td colspan="2">
<tr>
<td>
<td><span class="BOLD">
Jacksonville,
</span>
FL 300000
<td class="BOLD">Url:
<td colspan="2">
<a target="_blank" href="">PA Passion</a>
  
<span class="BOLD">E-mail:</span>
zoro@xyz.com
<tr>
<td>
<td class="REDBOLD" colspan="4">
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
Nutrella
</span>
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
APPLE Foods, Constants
</span>
<tr>
<td>
<td colspan="4" align="LEFT"><span class="BOLD">
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We service:<span class="BOLD">
All occasions and hospitality services
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We sell :<span class="BOLD">
----
</span>
</td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td></td>
如何从中提取地址 120 NW 157TH AVE?
当我尝试使用
node.SelectSingleNode(".//td[@class='BOLD'][4]/preceding-sibling::td").InnerText;
我收到一个错误:
Object reference not set to an instance of an object
你的html一团糟标签重叠我建议你使用文本节点作为你的标识符而不是索引
.//td[./a[contains(text(),'See on Map')]]/td/text()
获得
120 NW 157TH AVE
这是一个完整的例子,可以让你了解一切
var table = doc.DocumentNode.SelectSingleNode("//table[contains(@class,'DataRows')]");
var name = table.SelectSingleNode(".//td[@class='BOLD']/text()").InnerText.Trim();
var fax = table.SelectSingleNode(".//td[contains(text(),'Fax')]/td/text()").InnerText.Trim();
var email = table.SelectSingleNode(".//span[contains(text(),'E-mail')]/following-sibling::text()").InnerText.Trim();
var address = table.SelectSingleNode(".//td[./a[contains(text(),'See on Map')]]/td/text()").InnerText.Trim();
var city = table.SelectSingleNode(".//tr[./td/a[contains(text(),'See on Map')]]//tr/td/td/span").InnerText.Trim(',');
var zip = table.SelectSingleNode(".//tr[./td/a[contains(text(),'See on Map')]]//tr/td/td/span/following-sibling::text()").InnerText.Trim();
请注意,由于您的 html 有多混乱,xpath 也必须如此混乱,因此尝试通过索引访问 tr
元素将不起作用,因为所有 tr 元素都是前一个元素的子元素tr
,正常 table 中的 .//tr[4]
是 table 中的 .//tr/tr/tr/tr
。
这是我的table
<table class="DataRows" frame="myFrames" rules="Standard" width="100%">
<colgroup><col width="70" align="CENTER">
<col width="200" align="LEFT">
<col width="80" align="LEFT">
<col align="LEFT">
<col align="RIGHT">
</colgroup><thead>
<col width="70" align="CENTER">
<col width="200" align="LEFT">
<col width="80" align="LEFT">
<col align="LEFT">
<col align="RIGHT">
<thead>
<tr>
<td valign="TOP"><span class="classicBold"> 20 </span> Kg.
<td class="BOLD" valign="TOP" nowrap="">
PA Passion Foods Inc.
<td class="BOLD">Fax:
<td>
222-555666
<td class="BOLD">
Processed foods and juices
<tr>
<td><a target="_blank" href="">See on Map </a>
<td>
120 NW 157TH AVE
<td class="BOLD">Warehouse Hours:
<td colspan="2">
<tr>
<td>
<td><span class="BOLD">
Jacksonville,
</span>
FL 300000
<td class="BOLD">Url:
<td colspan="2">
<a target="_blank" href="">PA Passion</a>
  
<span class="BOLD">E-mail:</span>
zoro@xyz.com
<tr>
<td>
<td class="REDBOLD" colspan="4">
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
Nutrella
</span>
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
APPLE Foods, Constants
</span>
<tr>
<td>
<td colspan="4" align="LEFT"><span class="BOLD">
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We service:<span class="BOLD">
All occasions and hospitality services
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We sell :<span class="BOLD">
----
</span>
</td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td></td></tr>
</thead>
</table>
我正在使用下面的代码Html 遍历我的 Html 文档中的每个节点
foreach (HtmlNode node in htmlAgilityPackDoc.DocumentNode.SelectNodes("//table[contains(@class,'DataRows')]"))
{
}
当我使用下面的
node.SelectSingleNode(".//tr[1]/td[1]").InnerHtml
我得到以下 html
<span class="classicBold"> 20 </span> Kg.
<td class="BOLD" valign="TOP" nowrap="">
PA Passion Foods Inc.
<td class="BOLD">Fax:
<td>
222-555666
<td class="BOLD">
Processed foods and juices
<tr>
<td><a target="_blank" href="">See on Map </a>
<td>
120 NW 157TH AVE
<td class="BOLD">Warehouse Hours:
<td colspan="2">
<tr>
<td>
<td><span class="BOLD">
Jacksonville,
</span>
FL 300000
<td class="BOLD">Url:
<td colspan="2">
<a target="_blank" href="">PA Passion</a>
  
<span class="BOLD">E-mail:</span>
zoro@xyz.com
<tr>
<td>
<td class="REDBOLD" colspan="4">
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
Nutrella
</span>
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
APPLE Foods, Constants
</span>
<tr>
<td>
<td colspan="4" align="LEFT"><span class="BOLD">
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We service:<span class="BOLD">
All occasions and hospitality services
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We sell :<span class="BOLD">
----
</span>
</td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td></td>
如何从中提取地址 120 NW 157TH AVE?
当我尝试使用
node.SelectSingleNode(".//td[@class='BOLD'][4]/preceding-sibling::td").InnerText;
我收到一个错误:
Object reference not set to an instance of an object
你的html一团糟标签重叠我建议你使用文本节点作为你的标识符而不是索引
.//td[./a[contains(text(),'See on Map')]]/td/text()
获得
120 NW 157TH AVE
这是一个完整的例子,可以让你了解一切
var table = doc.DocumentNode.SelectSingleNode("//table[contains(@class,'DataRows')]");
var name = table.SelectSingleNode(".//td[@class='BOLD']/text()").InnerText.Trim();
var fax = table.SelectSingleNode(".//td[contains(text(),'Fax')]/td/text()").InnerText.Trim();
var email = table.SelectSingleNode(".//span[contains(text(),'E-mail')]/following-sibling::text()").InnerText.Trim();
var address = table.SelectSingleNode(".//td[./a[contains(text(),'See on Map')]]/td/text()").InnerText.Trim();
var city = table.SelectSingleNode(".//tr[./td/a[contains(text(),'See on Map')]]//tr/td/td/span").InnerText.Trim(',');
var zip = table.SelectSingleNode(".//tr[./td/a[contains(text(),'See on Map')]]//tr/td/td/span/following-sibling::text()").InnerText.Trim();
请注意,由于您的 html 有多混乱,xpath 也必须如此混乱,因此尝试通过索引访问 tr
元素将不起作用,因为所有 tr 元素都是前一个元素的子元素tr
,正常 table 中的 .//tr[4]
是 table 中的 .//tr/tr/tr/tr
。