使用 Htmlagilitypack 从维基百科抓取数据

Question

我正在尝试从维基百科网站上的 table 抓取数据，到目前为止，我已经设法找到了我需要引用的节点。维基百科 table 中有大量条目，但是，当我运行应用程序时，我只得到十二个结果，而且它们都是相同的。返回的所有结果都是 table.

中第一个条目的副本

关于如何解决的任何想法？

protected async override void OnNavigatedTo(NavigationEventArgs e)
{
    base.OnNavigatedTo(e);
    string htmlPage = "";
    {
        htmlPage = await client.GetStringAsync("http://en.wikipedia.org/wiki/List_of_Games_with_Gold_games");
    }

HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlPage);

foreach (var div in htmlDocument.DocumentNode.SelectNodes(".//h2"))
{
    GameHistory newGameHistory = new GameHistory();
    newGameHistory.historyTitle = div.SelectSingleNode("//i//a").InnerText.Trim();
    newGameHistory.historyAdded = div.SelectSingleNode("//span[starts-with(@style, 'white')]").InnerText.Trim();
    newGameHistory.historyRemoved = div.SelectSingleNode("(//span[starts-with(@style, 'white')])[2]").InnerText.Trim();
    gameHistory.Add(newGameHistory);
    }
lstGameHistory.ItemsSource = gameHistory;
}

Answer 1

您的 XPath 不完全正确...

foreach (var div in htmlDocument.DocumentNode.SelectNodes(".//h2"))
{
    GameHistory newGameHistory = new GameHistory();
    newGameHistory.historyTitle = div.SelectSingleNode("//i//a").InnerText.Trim();
    newGameHistory.historyAdded = div.SelectSingleNode("//span[starts-with(@style, 'white')]").InnerText.Trim();
    newGameHistory.historyRemoved = div.SelectSingleNode("(//span[starts-with(@style, 'white')])[2]").InnerText.Trim();
    gameHistory.Add(newGameHistory);
}

表示"I have an h2 tag. Let me get all of the i tags with a tags inside of them, and certain span tags...nothing to do with the h2 tag though. And let's just keep on getting the firsts in the entire document."（双斜杠的意思）。

您得到 12 个结果，因为那是 h2 标签的数量。

不管怎么说，就算你特地用了一个h2标签作为参考，好像也跟行数关系不大，看看吧！

所以您需要的是获得一个 XPath，它将获得正确 table 的每一行（在本例中为 tables）。然后，对于每一行，您的 XPath 应该以“.”开头。 (self)，这样您就不会再次返回到文档的根目录。

此外，一些游戏没有 "Removed" 列，因此您也应该处理它。

瞧我的代码：

    foreach (var div in htmlDocument.DocumentNode.SelectNodes("//table[@class='wikitable sortable']/tr[td/i/a]"))
    {
        GameHistory newGameHistory = new GameHistory();
        newGameHistory.historyTitle = div.SelectSingleNode(".//i//a").InnerText.Trim();
        newGameHistory.historyAdded = div.SelectSingleNode(".//span[starts-with(@style, 'white')]").InnerText.Trim();
        newGameHistory.historyRemoved = div.SelectSingleNode("(.//span[starts-with(@style, 'white')])[2]") != null? div.SelectSingleNode("(.//span[starts-with(@style, 'white')])[2]").InnerText.Trim() : string.Empty;
        gameHistory.Add(newGameHistory);
    }

提示：要获取标题，请在 foreach 循环内（从 tr 开始），向上一次 (..) 转到 table 标签，然后，要获取 h2 即 table 之前的标签，请使用 preceding-sibling。

所以 XPath 将是 "../preceding-sibling::h2"。似乎 h2 会捕获一些其他字符，因此您必须进一步完善您的 XPath。

使用 Htmlagilitypack 从维基百科抓取数据

Scraping data from Wikipedia using Htmlagilitypack

c#

web-scraping

html-agility-pack

windows-phone-8