使用线程从网站查找链接

Question

我正在开发从网站获取所有链接并搜索输入词的程序。然后输入每个链接并再次搜索等等。程序执行 3 次（这就是 n 为 3 的原因）。下面的代码使用递归方法完成它并且似乎工作得很好。

但是我想通过使用线程来加快这个过程。我该如何实施？据我所知，我可以适当地使用 fork/join。

 public static void getLinks(String url, Set<String> urls, String word, int n) {
    if(url.contains(word)) {
        System.out.println("Found: " + url);
    }

    if (urls.contains(url)) {
        return;
    }
    urls.add(url);

    if(n<3) {
        try {
            Document doc = Jsoup.connect(url).get();
            Elements elements = doc.select("a[href]");
            for (Element element : elements) {
                System.out.println(element.absUrl("href"));
                getLinks(element.absUrl("href"), urls, word, n + 1);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    } else return;
}

public static void main(String[] args) {
    Set<String> links = new HashSet<>();
    String word = "root";
    getLinks("https://example.com", links, word, 0);
}

PS 在程序的最终版本中，与输入词匹配的链接将打印在 GUI 中。

Answer 1

您可以使用工作队列，在其中提交要执行的可运行对象。当您发现链接时，您提交要抓取的基础页面的任务。

基本上有工作的生产者和工作的消费者。

https://www.baeldung.com/java-blocking-queue

Answer 2

简单的方法是在遍历 Elements 时将 getLinks 提交给 thread pool:

    static ExecutorService executorService = Executors.newCachedThreadPool();
    static List<Callable<Object>> todo = new ArrayList<>();
    public static void main(String[] args) throws ExecutionException, InterruptedException {
        getLinks();
        // Wait until all tasks are complete
        // Or use invokeAll(collection, timeout) if you want to have a maximum wait time
        executorService.invokeAll(todo);
        executorService.shutdown();
    }

    public static void getLinks(String url, Set<String> urls, String word, int n) {
        if(n<3) {
            try {
                for (Element element : new ArrayList<Element>()) {
                    todo.add(Executors.callable(() -> getLinks()));
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        } else {
            return;
        }
    }

使用线程从网站查找链接

Finding links from the website using threads

java

multithreading

jsoup