如何提取 HTML 文件中的所有链接 (href)？

Question

我正在尝试使用 Java 从 HTML 文件中提取所有链接。

模式似乎是<a href = "Name">。我想获得 URL 使我能够访问所需的网页。

你们能帮我解决一下吗（string.contains？string.indexof？）？

谢谢。

Answer 1

基本的基本方法是使用正则表达式匹配。

    String html = "YOUR HTML";
    String regex = "<a href\s?=\s?\"([^\"]+)\">";
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(html);
    int index = 0;
    while (matcher.find(index)) {
        String wholething = matcher.group(); // includes "<a href" and ">"
        String link = matcher.group(1); // just the link
        // do something with wholething or link.
        index = matcher.end();
    }

另一方面，您可以使用 Document 之类的东西。这个我不是很了解

如何提取 HTML 文件中的所有链接 (href)？

How can I extract all links (href) in an HTML file?

java

href