JSoup 无法从 html 获取链接
JSoup not able to get links from html
我正在尝试从网站的 html 获取链接,但无法使用 Jsoup 获取链接。
这是HTML:
<div class="anime_muti_link">
<ul>
<li><div class="doamin">Domain</div><div class="link">Link</div></li>
<li class="anime">
<a href="#" class="active" rel="1" data-video="example.com" ><div class="server m1">Server m1</div><span>Watch This Link</span></a>
</li>
<li class="anime">
<a href="#" rel="1" data-video="example.com" ><div class="server m1">Server m2</div><span>Watch This Link</span></a>
</li>
<li class="xstreamcdn">
<a href="#" rel="29" data-video="example.com">Xstreamcdn</div><span>Watch This Link</span></a>
</li>
<li class="mixdrop">
<a href="#" rel="7" data-video="example.com"><div class="server mixdrop">Mixdrop</div><span>Watch This Link</span></a>
</li>
<li class="streamsb">
<a href="#" rel="13" data-video="example.com">StreamSB</div><span>Watch This Link</span></a>
</li>
<li class="doodstream">
<a href="#" rel="14" data-video="example.com">Doodstream</div><span>Watch This Link</span></a>
</li>
</ul>
</div>
这是我编写的 android 代码,但似乎不起作用:
try {
Document doc = Jsoup.connect(URL).get();
Elements content = doc.getElementsByClass("anime_muti_link");
Elements links = content.select("a");
String[] urls = new String[links.size()];
for (int i = 0; i < links.size(); i++) {
urls[i] = links.get(i).attr("data-video");
if (!urls[i].startsWith("https://")) {
urls[i] = "https:" + urls[i];
}
}
arrayList.addAll(Arrays.asList(urls));
Log.d("CALLING_URL", "Links: " + Arrays.toString(urls));
} catch (IOException e) {
e.getMessage();
}
有人可以帮我解决这个问题吗?谢谢
编辑:基本上我正在尝试获取这 6 个链接并将它们添加到我的列表中以在应用程序中使用它。
编辑 2:
所以我发现另一个 HTML 看起来更好:
<div class="heading-servers">
<span><i class="fa fa-signal"></i> Servers</span>
<ul class="servers">
<li data-vs="https://example.com" class="server server-active" style="display: block;" onclick="return loadIframe('ifrm', this.getAttribute('data-vs'));">Netu</li>
<li data-vs="https://example.com" class="server" style="display: block;" onclick="return loadIframe('ifrm', this.getAttribute('data-vs'));">VideoVard</li>
<li data-vs="https://example.com" class="server" style="display: block;" onclick="return loadIframe('ifrm', this.getAttribute('data-vs'));">Doodstream</li>
<li data-vs="https://example.com" class="server" style="display: block;" onclick="return loadIframe('ifrm', this.getAttribute('data-vs'));">Okstream</li>
</ul>
</div>
如您所见,在此 li
定义中,您包含嵌套的 div
:
<li class="xstreamcdn">
<a href="#" rel="29" data-video="example.com">Xstreamcdn</div><span>Watch This Link</span></a>
</li>
这导致变量内容,带有 class anime_muti_link
的 HTML 片段,看起来像:
<div class="anime_muti_link">
<ul>
<li>
<div class="doamin">
Domain
</div>
<div class="link">
Link
</div></li>
<li class="anime"> <a href="#" class="active" rel="1" data-video="example.com">
<div class="server m1">
Server m1
</div><span>Watch This Link</span></a> </li>
<li class="anime"> <a href="#" rel="1" data-video="example.com">
<div class="server m1">
Server m2
</div><span>Watch This Link</span></a> </li>
<li class="xstreamcdn"> <a href="#" rel="29" data-video="example.com">Xstreamcdn</a></li>
</ul>
</div>
即使您整理 HTML 也会获得类似的结果。我使用了我的 previous answers:
之一的代码
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.setIndentContent(true);
tidy.setPrintBodyOnly(true);
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setSmartIndent(true);
tidy.setShowWarnings(false);
tidy.setQuiet(true);
tidy.setTidyMark(false);
org.w3c.dom.Document htmlDOM = tidy.parseDOM(new ByteArrayInputStream(html.getBytes()), null);
OutputStream out = new ByteArrayOutputStream();
tidy.pprint(htmlDOM, out);
String tidiedHtml = out.toString();
// System.out.println(tidiedHtml);
Document document = Jsoup.parse(tidiedHtml);
Elements content = document.getElementsByClass("anime_muti_link");
System.out.println(content);
这就是您只找到三个锚点的原因。
请尝试更正您的 HTML 或选择锚标记作为文档级别:
Document document = Jsoup.parse(html);
// Elements content = document.getElementsByClass("anime_muti_link");
// System.out.println(content);
Elements links = document.select("a");
String[] urls = new String[links.size()];
for (int i = 0; i < links.size(); i++) {
urls[i] = links.get(i).attr("data-video");
if (!urls[i].startsWith("https://")) {
urls[i] = "https://" + urls[i];
}
}
System.out.println(Arrays.asList(urls));
如果获得的结果包含不需要的链接,也许您可以尝试缩小使用的选择器范围,例如:
document.select(".anime_muti_link a")
如果这不起作用,另一种可能的替代方法是选择具有 data-video
属性的锚元素,a[data-video]
:
Document document = Jsoup.parse(html);
Elements videoLinks = document.select("a[data-video]");
String[] urls = new String[videoLinks.size()];
for (int i = 0; i < videoLinks.size(); i++) {
urls[i] = videoLinks.get(i).attr("data-video");
if (!urls[i].startsWith("https://")) {
urls[i] = "https://" + urls[i];
}
}
System.out.println(Arrays.asList(urls));
使用您的新测试用例,您可以使用非常相似的代码获取所需信息:
String html = "<div class=\"heading-servers\">\n" +
" <span><i class=\"fa fa-signal\"></i> Servers</span>\n" +
" <ul class=\"servers\">\n" +
" <li data-vs=\"https://example.com\" class=\"server server-active\" style=\"display: block;\" onclick=\"return loadIframe('ifrm', this.getAttribute('data-vs'));\">Netu</li>\n" +
" <li data-vs=\"https://example.com\" class=\"server\" style=\"display: block;\" onclick=\"return loadIframe('ifrm', this.getAttribute('data-vs'));\">VideoVard</li>\n" +
" <li data-vs=\"https://example.com\" class=\"server\" style=\"display: block;\" onclick=\"return loadIframe('ifrm', this.getAttribute('data-vs'));\">Doodstream</li>\n" +
" <li data-vs=\"https://example.com\" class=\"server\" style=\"display: block;\" onclick=\"return loadIframe('ifrm', this.getAttribute('data-vs'));\">Okstream</li>\n" +
" </ul>\n" +
" </div>";
Document document = Jsoup.parse(html);
Elements videoLinks = document.select("div.heading-servers ul.servers li.server");
String[] urls = new String[videoLinks.size()];
for (int i = 0; i < videoLinks.size(); i++) {
urls[i] = videoLinks.get(i).attr("data-vs");
if (!urls[i].startsWith("https://")) {
urls[i] = "https://" + urls[i];
}
}
System.out.println(Arrays.asList(urls));
最重要的部分是应该应用于已解析文档的 selector 的定义,在我们的例子中是 div.heading-servers ul.servers li.server
。
我提供了一个有很多片段的选择器,但根据实际使用HTML可以简化为ul.servers li.server
甚至li.server
。
我正在尝试从网站的 html 获取链接,但无法使用 Jsoup 获取链接。
这是HTML:
<div class="anime_muti_link">
<ul>
<li><div class="doamin">Domain</div><div class="link">Link</div></li>
<li class="anime">
<a href="#" class="active" rel="1" data-video="example.com" ><div class="server m1">Server m1</div><span>Watch This Link</span></a>
</li>
<li class="anime">
<a href="#" rel="1" data-video="example.com" ><div class="server m1">Server m2</div><span>Watch This Link</span></a>
</li>
<li class="xstreamcdn">
<a href="#" rel="29" data-video="example.com">Xstreamcdn</div><span>Watch This Link</span></a>
</li>
<li class="mixdrop">
<a href="#" rel="7" data-video="example.com"><div class="server mixdrop">Mixdrop</div><span>Watch This Link</span></a>
</li>
<li class="streamsb">
<a href="#" rel="13" data-video="example.com">StreamSB</div><span>Watch This Link</span></a>
</li>
<li class="doodstream">
<a href="#" rel="14" data-video="example.com">Doodstream</div><span>Watch This Link</span></a>
</li>
</ul>
</div>
这是我编写的 android 代码,但似乎不起作用:
try {
Document doc = Jsoup.connect(URL).get();
Elements content = doc.getElementsByClass("anime_muti_link");
Elements links = content.select("a");
String[] urls = new String[links.size()];
for (int i = 0; i < links.size(); i++) {
urls[i] = links.get(i).attr("data-video");
if (!urls[i].startsWith("https://")) {
urls[i] = "https:" + urls[i];
}
}
arrayList.addAll(Arrays.asList(urls));
Log.d("CALLING_URL", "Links: " + Arrays.toString(urls));
} catch (IOException e) {
e.getMessage();
}
有人可以帮我解决这个问题吗?谢谢
编辑:基本上我正在尝试获取这 6 个链接并将它们添加到我的列表中以在应用程序中使用它。
编辑 2:
所以我发现另一个 HTML 看起来更好:
<div class="heading-servers">
<span><i class="fa fa-signal"></i> Servers</span>
<ul class="servers">
<li data-vs="https://example.com" class="server server-active" style="display: block;" onclick="return loadIframe('ifrm', this.getAttribute('data-vs'));">Netu</li>
<li data-vs="https://example.com" class="server" style="display: block;" onclick="return loadIframe('ifrm', this.getAttribute('data-vs'));">VideoVard</li>
<li data-vs="https://example.com" class="server" style="display: block;" onclick="return loadIframe('ifrm', this.getAttribute('data-vs'));">Doodstream</li>
<li data-vs="https://example.com" class="server" style="display: block;" onclick="return loadIframe('ifrm', this.getAttribute('data-vs'));">Okstream</li>
</ul>
</div>
如您所见,在此 li
定义中,您包含嵌套的 div
:
<li class="xstreamcdn">
<a href="#" rel="29" data-video="example.com">Xstreamcdn</div><span>Watch This Link</span></a>
</li>
这导致变量内容,带有 class anime_muti_link
的 HTML 片段,看起来像:
<div class="anime_muti_link">
<ul>
<li>
<div class="doamin">
Domain
</div>
<div class="link">
Link
</div></li>
<li class="anime"> <a href="#" class="active" rel="1" data-video="example.com">
<div class="server m1">
Server m1
</div><span>Watch This Link</span></a> </li>
<li class="anime"> <a href="#" rel="1" data-video="example.com">
<div class="server m1">
Server m2
</div><span>Watch This Link</span></a> </li>
<li class="xstreamcdn"> <a href="#" rel="29" data-video="example.com">Xstreamcdn</a></li>
</ul>
</div>
即使您整理 HTML 也会获得类似的结果。我使用了我的 previous answers:
之一的代码Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.setIndentContent(true);
tidy.setPrintBodyOnly(true);
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setSmartIndent(true);
tidy.setShowWarnings(false);
tidy.setQuiet(true);
tidy.setTidyMark(false);
org.w3c.dom.Document htmlDOM = tidy.parseDOM(new ByteArrayInputStream(html.getBytes()), null);
OutputStream out = new ByteArrayOutputStream();
tidy.pprint(htmlDOM, out);
String tidiedHtml = out.toString();
// System.out.println(tidiedHtml);
Document document = Jsoup.parse(tidiedHtml);
Elements content = document.getElementsByClass("anime_muti_link");
System.out.println(content);
这就是您只找到三个锚点的原因。
请尝试更正您的 HTML 或选择锚标记作为文档级别:
Document document = Jsoup.parse(html);
// Elements content = document.getElementsByClass("anime_muti_link");
// System.out.println(content);
Elements links = document.select("a");
String[] urls = new String[links.size()];
for (int i = 0; i < links.size(); i++) {
urls[i] = links.get(i).attr("data-video");
if (!urls[i].startsWith("https://")) {
urls[i] = "https://" + urls[i];
}
}
System.out.println(Arrays.asList(urls));
如果获得的结果包含不需要的链接,也许您可以尝试缩小使用的选择器范围,例如:
document.select(".anime_muti_link a")
如果这不起作用,另一种可能的替代方法是选择具有 data-video
属性的锚元素,a[data-video]
:
Document document = Jsoup.parse(html);
Elements videoLinks = document.select("a[data-video]");
String[] urls = new String[videoLinks.size()];
for (int i = 0; i < videoLinks.size(); i++) {
urls[i] = videoLinks.get(i).attr("data-video");
if (!urls[i].startsWith("https://")) {
urls[i] = "https://" + urls[i];
}
}
System.out.println(Arrays.asList(urls));
使用您的新测试用例,您可以使用非常相似的代码获取所需信息:
String html = "<div class=\"heading-servers\">\n" +
" <span><i class=\"fa fa-signal\"></i> Servers</span>\n" +
" <ul class=\"servers\">\n" +
" <li data-vs=\"https://example.com\" class=\"server server-active\" style=\"display: block;\" onclick=\"return loadIframe('ifrm', this.getAttribute('data-vs'));\">Netu</li>\n" +
" <li data-vs=\"https://example.com\" class=\"server\" style=\"display: block;\" onclick=\"return loadIframe('ifrm', this.getAttribute('data-vs'));\">VideoVard</li>\n" +
" <li data-vs=\"https://example.com\" class=\"server\" style=\"display: block;\" onclick=\"return loadIframe('ifrm', this.getAttribute('data-vs'));\">Doodstream</li>\n" +
" <li data-vs=\"https://example.com\" class=\"server\" style=\"display: block;\" onclick=\"return loadIframe('ifrm', this.getAttribute('data-vs'));\">Okstream</li>\n" +
" </ul>\n" +
" </div>";
Document document = Jsoup.parse(html);
Elements videoLinks = document.select("div.heading-servers ul.servers li.server");
String[] urls = new String[videoLinks.size()];
for (int i = 0; i < videoLinks.size(); i++) {
urls[i] = videoLinks.get(i).attr("data-vs");
if (!urls[i].startsWith("https://")) {
urls[i] = "https://" + urls[i];
}
}
System.out.println(Arrays.asList(urls));
最重要的部分是应该应用于已解析文档的 selector 的定义,在我们的例子中是 div.heading-servers ul.servers li.server
。
我提供了一个有很多片段的选择器,但根据实际使用HTML可以简化为ul.servers li.server
甚至li.server
。