HTMLAgilityPack 和 XPath 目标
HTMLAgilityPack and XPath Target
我有以下 HTML:
<table>
<tr>
<td><a href="#">Tournament Name</a>
<br /> Tournament Address </td>
</tr>
<tr>
<td><a>View Available Space and Book Online</a></td>
</tr>
<tr>
<td>
<em>Event Cost:</em> $$$
</td>
<td> Date and Time </td>
</tr>
<tr>
<td>
<p>
<strong>
<img title="Boy's Teams can enter this tournament" />
<img title="Girl's Teams can not enter this tournament" />
<img title="Disabled Teams can not enter this tournament" />
</strong>
</p>
</td>
<td>
TimeFrame
</td>
</tr>
<tr>
<td>
<img src="image.gif" />
<img src="image.gif" />
<img src="image.gif" />
<img src="image.gif" />
<img src="image.gif" />
<img src="image.gif" />
<img src="image.gif" />
<img src="image...." />
<img src="image...." />
<img src="image...." />
<img src="image...." />
</td>
</tr>
</table>
(这个table在页面上重复了很多次)
我正在尝试提取 锦标赛名称。
我有以下 C# 代码:
namespace AcademyScraper
{
public partial class Main : Form
{
public Main()
{
InitializeComponent();
}
private void saveBtn_Click(object sender, EventArgs e)
{
string url = "http://www.reddishvulcans.com/uk_tournament_database.asp";
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
var root = doc.DocumentNode;
var nodes = root.Descendants();
HtmlNodeCollection tableCollection = doc.DocumentNode.SelectNodes("//div[@class='infobox']/table");
for (Int32 i = 0; i < tableCollection.Count(); i++)
{
HtmlNode tournamentName = tableCollection[i].SelectSingleNode("/tr[1]/td/a");
MessageBox.Show(tournamentName.InnerText);
// I get an exception here
}
}
}
}
我遇到的问题是,无论我尝试什么,我似乎都无法定位包含锦标赛名称的标签。如果我这样做 MessageBox.Show(tableCollection[i].OuterHTML);
,table 内容将在消息框内正常呈现,没有任何问题。但是,每当我尝试获取 tournamentName 时,我都会遇到引用异常。基于HTML我觉得应该是对的
你有一个任务与网络 var doc = Webget.Load(url);
一起工作,它可以做一些时间,但你在 main thread
-> 冲突中得到了它。您需要 运行 其他线程中的网络任务。注意 MessageBox.Show(tournamentName.InnerText);
是 UI 线程(主线程)你应该 运行 它在 INVOKE
委托中。
也许你可以尝试这样的事情(我创建了一个控制台应用程序来尝试):
private void saveBtn_Click(object sender, EventArgs e)
{
string url = "http://www.reddishvulcans.com/uk_tournament_database.asp";
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
var aTags = doc.DocumentNode.SelectNodes("//div[@class='infobox']/table/tr/td[1]/a");
foreach (var tag in aTags)
{
Console.WriteLine(tag.InnerText);
}
Console.ReadLine();
}
以下 XPath 似乎对我来说工作正常:
//div[@class='infobox']/table/tr/td[br]/a
控制台应用程序演示:
string url = "http://www.reddishvulcans.com/uk_tournament_database.asp";
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
//print top 10 result just for the sake of demo
var result = doc.DocumentNode
.SelectNodes("//div[@class='infobox']/table/tr/td[br]/a")
.Take(10);
foreach (HtmlNode node in result)
{
Console.WriteLine(node.InnerText);
}
输出:
The North West Junior Champions League 2016
PLAY AT CHELSEA - STAMFORD BRIDGE FOOTBALL TOURNAMENT 2016
PLAY AT FC BARCELONA - CAMP NOU FOOTBALL TOUR 2016 - THE EUROPA CUP
Silverdale Soccersevens XIX
NORTH HALIFAX MINI SOCCER TOURNAMENT 2016
Halton & District JFL Mini Soccer Tournament
Colwyn Bay FC Junior Tournament
GMCJFC Pat Mangan Festival of Football 2016
Fred England Trophy
Fred England Trophy
我有以下 HTML:
<table>
<tr>
<td><a href="#">Tournament Name</a>
<br /> Tournament Address </td>
</tr>
<tr>
<td><a>View Available Space and Book Online</a></td>
</tr>
<tr>
<td>
<em>Event Cost:</em> $$$
</td>
<td> Date and Time </td>
</tr>
<tr>
<td>
<p>
<strong>
<img title="Boy's Teams can enter this tournament" />
<img title="Girl's Teams can not enter this tournament" />
<img title="Disabled Teams can not enter this tournament" />
</strong>
</p>
</td>
<td>
TimeFrame
</td>
</tr>
<tr>
<td>
<img src="image.gif" />
<img src="image.gif" />
<img src="image.gif" />
<img src="image.gif" />
<img src="image.gif" />
<img src="image.gif" />
<img src="image.gif" />
<img src="image...." />
<img src="image...." />
<img src="image...." />
<img src="image...." />
</td>
</tr>
</table>
(这个table在页面上重复了很多次)
我正在尝试提取 锦标赛名称。
我有以下 C# 代码:
namespace AcademyScraper
{
public partial class Main : Form
{
public Main()
{
InitializeComponent();
}
private void saveBtn_Click(object sender, EventArgs e)
{
string url = "http://www.reddishvulcans.com/uk_tournament_database.asp";
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
var root = doc.DocumentNode;
var nodes = root.Descendants();
HtmlNodeCollection tableCollection = doc.DocumentNode.SelectNodes("//div[@class='infobox']/table");
for (Int32 i = 0; i < tableCollection.Count(); i++)
{
HtmlNode tournamentName = tableCollection[i].SelectSingleNode("/tr[1]/td/a");
MessageBox.Show(tournamentName.InnerText);
// I get an exception here
}
}
}
}
我遇到的问题是,无论我尝试什么,我似乎都无法定位包含锦标赛名称的标签。如果我这样做 MessageBox.Show(tableCollection[i].OuterHTML);
,table 内容将在消息框内正常呈现,没有任何问题。但是,每当我尝试获取 tournamentName 时,我都会遇到引用异常。基于HTML我觉得应该是对的
你有一个任务与网络 var doc = Webget.Load(url);
一起工作,它可以做一些时间,但你在 main thread
-> 冲突中得到了它。您需要 运行 其他线程中的网络任务。注意 MessageBox.Show(tournamentName.InnerText);
是 UI 线程(主线程)你应该 运行 它在 INVOKE
委托中。
也许你可以尝试这样的事情(我创建了一个控制台应用程序来尝试):
private void saveBtn_Click(object sender, EventArgs e)
{
string url = "http://www.reddishvulcans.com/uk_tournament_database.asp";
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
var aTags = doc.DocumentNode.SelectNodes("//div[@class='infobox']/table/tr/td[1]/a");
foreach (var tag in aTags)
{
Console.WriteLine(tag.InnerText);
}
Console.ReadLine();
}
以下 XPath 似乎对我来说工作正常:
//div[@class='infobox']/table/tr/td[br]/a
控制台应用程序演示:
string url = "http://www.reddishvulcans.com/uk_tournament_database.asp";
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
//print top 10 result just for the sake of demo
var result = doc.DocumentNode
.SelectNodes("//div[@class='infobox']/table/tr/td[br]/a")
.Take(10);
foreach (HtmlNode node in result)
{
Console.WriteLine(node.InnerText);
}
输出:
The North West Junior Champions League 2016
PLAY AT CHELSEA - STAMFORD BRIDGE FOOTBALL TOURNAMENT 2016
PLAY AT FC BARCELONA - CAMP NOU FOOTBALL TOUR 2016 - THE EUROPA CUP
Silverdale Soccersevens XIX
NORTH HALIFAX MINI SOCCER TOURNAMENT 2016
Halton & District JFL Mini Soccer Tournament
Colwyn Bay FC Junior Tournament
GMCJFC Pat Mangan Festival of Football 2016
Fred England Trophy
Fred England Trophy