JSoup - 如何只提取段落中的 href
JSoup - How to extract only the href in paragraph
我有以下内容:
</div>
<p>
<a href="https://urlIwant.com" data-wpel-link="internal">
<span class="image-holder" style="padding-bottom:149.92679355783%;">
<img loading="lazy" src="https://urlIwant.com" width="683" height="1024" class="alignnone size-full wp-image-200816" />
</span>
</a>
</p>
<p>
<span id="more-20000"></span>
</p>
<p>
<a href="https://urlIwant.com" data-wpel-link="internal">
<span class="image-holder" style="padding-bottom:149.92679355783%;">
<img loading="lazy" src="https://urlIwant.com" width="683" height="1024" class="alignnone size-full wp-image-200833" />
</span>
</a>
</p>
<p>
<a href="https://urlIwant.com" data-wpel-link="internal">
<span class="image-holder" style="padding-bottom:145.71428571429%;">
<img loading="lazy" src="https://urlIwant.com" width="700" height="1020" class="alignnone size-medium wp-image-200834" sizes="(max-width: 700px) 100vw, 700px" />
</span>
</a>
</p>
<p>
<a href="https://urlIwant.com" data-wpel-link="internal">
<span class="image-holder" style="padding-bottom:143.42857142857%;">
<img loading="lazy" src="https://urlIwant.com" width="700" height="1004" class="alignnone size-medium wp-image-200835" 836w" sizes="(max-width: 700px) 100vw, 700px" />
</span>
</a>
</p>
</div>
如何提取所有包含 paragraph
标记、href
和 class
"image-holder"
的网址?
我不知道如何添加跨度 class
try {
Document doc = Jsoup.connect("https://urltoextractfrom.com").get();
Elements selections = doc.select("p a[href]");
for (Element e : selections) {
System.out.println(e);
}
} catch (Exception e) {
e.printStackTrace();
}
如果有人有更好的答案,请 post 但我在尝试了 3 个小时后才开始工作...
try {
Document doc = Jsoup.connect("https://urltoextractfrom.com").get();
Elements selections = doc.select("p a[href]");
for (Element e : selections) {
Elements elements2 = e.select("span[class=image-holder]");
if(elements2.attr("class").equals("image-holder")){
System.out.println(e.attr("href"));
}
}
} catch (Exception e) {
e.printStackTrace();
}
如果我正确理解了你想提取的内容,你可以使用这个选择器:
p a:has(span.image-holder)
这会找到所有从 p
元素派生的 a
元素,并且包含 span
和 class image-holder
集。
所以在代码中:
Document document = Jsoup.parse(html);
Elements links = document.select("p a:has(span.image-holder)");
List<String> urls = links.eachAttr("href");
您可以使用 try.jsoup REPL 快速迭代选择器。
https://try.jsoup.org/~wvd2VHaJtnr10qEiLS9g_-E6UA8
(如果有你不想选择的内容,你可以在你的问题中举例说明。)
我有以下内容:
</div>
<p>
<a href="https://urlIwant.com" data-wpel-link="internal">
<span class="image-holder" style="padding-bottom:149.92679355783%;">
<img loading="lazy" src="https://urlIwant.com" width="683" height="1024" class="alignnone size-full wp-image-200816" />
</span>
</a>
</p>
<p>
<span id="more-20000"></span>
</p>
<p>
<a href="https://urlIwant.com" data-wpel-link="internal">
<span class="image-holder" style="padding-bottom:149.92679355783%;">
<img loading="lazy" src="https://urlIwant.com" width="683" height="1024" class="alignnone size-full wp-image-200833" />
</span>
</a>
</p>
<p>
<a href="https://urlIwant.com" data-wpel-link="internal">
<span class="image-holder" style="padding-bottom:145.71428571429%;">
<img loading="lazy" src="https://urlIwant.com" width="700" height="1020" class="alignnone size-medium wp-image-200834" sizes="(max-width: 700px) 100vw, 700px" />
</span>
</a>
</p>
<p>
<a href="https://urlIwant.com" data-wpel-link="internal">
<span class="image-holder" style="padding-bottom:143.42857142857%;">
<img loading="lazy" src="https://urlIwant.com" width="700" height="1004" class="alignnone size-medium wp-image-200835" 836w" sizes="(max-width: 700px) 100vw, 700px" />
</span>
</a>
</p>
</div>
如何提取所有包含 paragraph
标记、href
和 class
"image-holder"
的网址?
我不知道如何添加跨度 class
try {
Document doc = Jsoup.connect("https://urltoextractfrom.com").get();
Elements selections = doc.select("p a[href]");
for (Element e : selections) {
System.out.println(e);
}
} catch (Exception e) {
e.printStackTrace();
}
如果有人有更好的答案,请 post 但我在尝试了 3 个小时后才开始工作...
try {
Document doc = Jsoup.connect("https://urltoextractfrom.com").get();
Elements selections = doc.select("p a[href]");
for (Element e : selections) {
Elements elements2 = e.select("span[class=image-holder]");
if(elements2.attr("class").equals("image-holder")){
System.out.println(e.attr("href"));
}
}
} catch (Exception e) {
e.printStackTrace();
}
如果我正确理解了你想提取的内容,你可以使用这个选择器:
p a:has(span.image-holder)
这会找到所有从 p
元素派生的 a
元素,并且包含 span
和 class image-holder
集。
所以在代码中:
Document document = Jsoup.parse(html);
Elements links = document.select("p a:has(span.image-holder)");
List<String> urls = links.eachAttr("href");
您可以使用 try.jsoup REPL 快速迭代选择器。 https://try.jsoup.org/~wvd2VHaJtnr10qEiLS9g_-E6UA8
(如果有你不想选择的内容,你可以在你的问题中举例说明。)