使用 HTML 文档获取维基百科中的邮政编码

Question

我正在阅读此维基百科页面 -> http://en.wikipedia.org/wiki/List_of_postal_codes_in_Spain，西班牙邮政编码列表。

我的目标是从网页中的“完整代码”部分获取所有邮政编码。例如，我需要获取此信息（邮政编码 - 地区）：

03000 至 03099 - 阿利坎特 03189 - 比利亚马丁 03201 至 03299 - 埃尔切 03400 - 比耶纳

在我的代码中，我很难在标题“Full Codes 之后仅获得 li 和 a 标签".

    HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create("http://en.wikipedia.org/wiki/List_of_postal_codes_in_Spain");
    request.UserAgent = "Test wiki";
    WebResponse response = request.GetResponse();
    Stream stream = response.GetResponseStream();
    StreamReader reader = new StreamReader(stream);
    string htmlText = reader.ReadToEnd();

    var doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(htmlText);

    if (doc.DocumentNode != null)
    {
        HtmlNodeCollection divs = doc.DocumentNode.SelectNodes("//li");
        foreach (HtmlNode listElement in divs)
        {
            if (listElement.SelectNodes("//a[@href]").Count > 0)
            { // I do not get what I wish
                foreach (HtmlNode listElement2 in listElement.SelectNodes("//a[@href]"))
                {
                    string s = listElement2.Name;
                    string ss = listElement2.InnerText;
                }
            }
        }
    }

Answer 1

我个人会 avoid using regex for parsing HTML。为了让您开始，xpath 表达式在标题 "Full codes" 之后获取 <li> 标记大约是这样的：

//h2[span='Full codes']/following::li

但更准确地说，你可以 select <ul> sibling instead, then get the <li> child next :

//h2[span='Full codes']/following-sibling::ul/li

旁注，HtmlAgilityPack 的 HtmlWeb 也可以以更短的方式加载维基百科页面：

var doc = new HtmlWeb().Load("http://en.wikipedia.org/wiki/List_of_postal_codes_in_Spain");
if (doc.DocumentNode != null)
{
    var data = doc.DocumentNode.SelectNodes("//h2[span='Full codes']/following-sibling::ul/li");
    foreach (HtmlNode htmlNode in data)
    {
        Console.WriteLine(htmlNode.InnerText.Trim());
    }
}

使用 HTML 文档获取维基百科中的邮政编码

Get Zip Codes in Wikipedia with HTML Document

c#

webrequest

html-agility-pack