JSoup

Question

我有以下内容：

</div>
  <p>
    <a href="https://urlIwant.com" data-wpel-link="internal">
      <span class="image-holder" style="padding-bottom:149.92679355783%;">
        <img loading="lazy" src="https://urlIwant.com" width="683" height="1024" class="alignnone size-full wp-image-200816" />
      </span>
    </a>
  </p>
  <p>
    <span id="more-20000"></span>
  </p>
  <p>
    <a href="https://urlIwant.com" data-wpel-link="internal">
      <span class="image-holder" style="padding-bottom:149.92679355783%;">
        <img loading="lazy" src="https://urlIwant.com" width="683" height="1024" class="alignnone size-full wp-image-200833" />
      </span>
    </a>
  </p>
  <p>
    <a href="https://urlIwant.com" data-wpel-link="internal">
      <span class="image-holder" style="padding-bottom:145.71428571429%;">
        <img loading="lazy" src="https://urlIwant.com" width="700" height="1020" class="alignnone size-medium wp-image-200834" sizes="(max-width: 700px) 100vw, 700px" />
      </span>
    </a>
  </p>
  <p>
    <a href="https://urlIwant.com" data-wpel-link="internal">
      <span class="image-holder" style="padding-bottom:143.42857142857%;">
        <img loading="lazy" src="https://urlIwant.com" width="700" height="1004" class="alignnone size-medium wp-image-200835" 836w" sizes="(max-width: 700px) 100vw, 700px" />
      </span>
    </a>
  </p>
</div>

如何提取所有包含 paragraph 标记、href 和 class "image-holder" 的网址？

我不知道如何添加跨度 class

try {
    Document doc = Jsoup.connect("https://urltoextractfrom.com").get();
    Elements selections = doc.select("p a[href]");
    for (Element e : selections) {
        System.out.println(e);
    }
} catch (Exception e) {
    e.printStackTrace();
}

Answer 1

如果有人有更好的答案，请 post 但我在尝试了 3 个小时后才开始工作...

try {
            Document doc = Jsoup.connect("https://urltoextractfrom.com").get();
            Elements selections = doc.select("p a[href]");
            for (Element e : selections) {
                Elements elements2 = e.select("span[class=image-holder]");
                if(elements2.attr("class").equals("image-holder")){
                    System.out.println(e.attr("href"));
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        }

Answer 2

如果我正确理解了你想提取的内容，你可以使用这个选择器：

p a:has(span.image-holder)

这会找到所有从 p 元素派生的 a 元素，并且包含 span 和 class image-holder 集。

所以在代码中：

Document document = Jsoup.parse(html);
Elements links = document.select("p a:has(span.image-holder)");
List<String> urls = links.eachAttr("href");

您可以使用 try.jsoup REPL 快速迭代选择器。 https://try.jsoup.org/~wvd2VHaJtnr10qEiLS9g_-E6UA8

（如果有你不想选择的内容，你可以在你的问题中举例说明。）

JSoup - 如何只提取段落中的 href

JSoup - How to extract only the href in paragraph

java