使用 jsoup 从两个标签之间提取未识别的 html 内容？正则表达式？

Question

我想从那里的两个 h2 标签之间获取所有这些链接的名称

<h2><span class="mw-headline" id="People">People</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Bush&amp;action=edit&amp;section=1" title="Edit section: People">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<ul>
<li><a href="/wiki/George_H._W._Bush" title="George H. W. Bush">George H. W. Bush</a> (born 1924), the 41st president of the United States of America</li>
<li><a href="/wiki/George_W._Bush" title="George W. Bush">George W. Bush</a> (born 1946), the 43rd president of the United States of America</li>
<li><a href="/wiki/Jeb_Bush" title="Jeb Bush">Jeb Bush</a> (born 1953), the former governor of Florida and also a member of the Bush family</li>
<li><a href="/wiki/Bush_family" title="Bush family">Bush family</a>, the political family that includes both presidents</li>
<li><a href="/wiki/Bush_(surname)" title="Bush (surname)">Bush (surname)</a>, a surname (including a list of people with the name)    </li>
</ul>
<h2><span class="mw-headline" id="Places.2C_United_States">Places, United States</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Bush&amp;action=edit&amp;section=2" title="Edit section: Places, United States">edit</a><span class="mw-editsection-bracket">]</span></span></h2>

这都不是

    Elements h2next = docx.select("span.mw-headline#People");
    do 
    {
     ul = h2next.select("ul").first();
     System.out.println(ul.text());
    } 
    while (h2next!=null && ul==null);

也不

    //String content = docx.getElementById("People").outerHtml();

有效。

好像this guy，想法是对的，但是我不能让它适应我的情况。

也许我应该只使用正则表达式？

维基百科 html 有点像 "unstructured" 并且很难使用。

来自 the wikipedia disambiguation page 我想抓住 Bush（或我正在考虑的任何模棱两可的名字）可以用作人的不同意义。

我已经尝试了各种方法来使用 jsoup 获取这些数据，但我一直无法弄清楚。

我试过这个：

Document docx = Jsoup.connect("https://en.wikipedia.org/wiki/Bush").get();
Element contentDiv = docx.select("span#mw-headlinePeople").first();
String printMe = contentDiv.toString(); // The result

因为我注意到我想要的数据位于名为的分区中：

 <h2><span class="mw-headline" id="People">

但是那没有输出。

我根据之前的问题尝试了一些变体，比如这个：

.select("span#mw-headlinePeople");

但还是一无所获。

如何获取该信息？

理想情况下，我想要的是这样的：

George H. W. Bush 
George W. Bush 
Jeb Bush

虽然我知道我最初可能还必须获得 Bush family 和 Bush (surname)，因为它们是该部分的一部分，但我想我可以稍后删除它们。

另外，用这个是不是更快：

Document docx = Jsoup.connect("https://en.wikipedia.org/wiki/Bush").get();

或者这个：

    URL site_two = new URL("https://en.wikipedia.org/wiki/Bush");

    URLConnection ycb = site_two.openConnection();
    BufferedReader inb = new BufferedReader(
                            new InputStreamReader(
                            ycb.getInputStream()));

    StringBuilder sb = new StringBuilder();

    while ((inputLine = inb.readLine()) != null) 
    {
        //get the disambig
        //System.out.println(inputLine);

        sb.append(inputLine);
        sb.append(System.lineSeparator());
        inputLine = inb.readLine();
    }

我尝试使用this site，但结果证明它不是很有用。有人应该像所有那些正则表达式网站一样制作一个 jsoup 网站。

Answer 1

一种可能的方法是 select 所有标题 (span.mw-headlines) 和所有链接（最好 select 或者我发现是 li > a）。

如果您 select 同时使用一个 select 或将它们与 , 组合，它们将按照它们在页面上出现的顺序排列。因此，您可以在循环遍历结果时跟踪自己是否处于 "People section" 中：

Elements elements = docx.select("span.mw-headline, li > a");

boolean inPeopleSection = false;
for (Element elem : elements) {
    if (elem.className().equals("mw-headline")) {
        // It's a headline
        inPeopleSection = elem.id().equals("People");
    } else {
        // It's a link
        if (inPeopleSection) {
            System.out.println(elem.text());
        }
    }
}

输出：

George H. W. Bush
George W. Bush
Jeb Bush
Bush family
Bush (surname)

关于性能，我认为它根本没有任何区别，只需使用更简单的版本（虽然我的 Jsoup 经验非常有限，所以不要相信我的话）。

Answer 2

一个简单的选择器是 h2:contains(people) + ul a，例如：

Elements els = doc.select("h2:contains(people) + ul a");

其中给出了这些元素：

0 <a href="/wiki/George_H._W._Bush" title="George H. W. Bush">
George H. W. Bush
1 <a href="/wiki/George_W._Bush" title="George W. Bush">
George W. Bush
2 <a href="/wiki/Jeb_Bush" title="Jeb Bush">
Jeb Bush
3 <a href="/wiki/Bush_family" title="Bush family">
Bush family
4 <a href="/wiki/Bush_(surname)" title="Bush (surname)">
Bush (surname)

我使用了 try.jsoup.org（参见 working example) and the selector syntax guide 作为资源。

使用 jsoup 从两个标签之间提取未识别的 html 内容？正则表达式？

extract unidentified html content from between two tags, using jsoup? regex?

html

java

parsing

wikipedia

jsoup