在 java 中使用多线程下载

Question

我正在尝试同时下载 HTML-url 存储在数据库中的网站代码（大约 300 万个条目）。
很明显我应该使用多线程技术，但我在 java 中遇到了如何使用它的麻烦。

以下是我以前在没有多线程的情况下的做法：

final Connection c = dbConnect(); // register jdbc-driver and establish connection
checkRequiredDbAndTables();  // here we check the existence of the Db and necessary tables

try {
    // now get list of urls from the db
    String sql = "select id, website_url, category_id from list_of_websites";
    PreparedStatement ps = c.prepareStatement(sql);
    ResultSet rs = ps.executeQuery();

    while (rs.next()) {
    // column numeration in ResultSet is from 1 !
        final long id = rs.getInt(1);   // get website id
        final String url = rs.getString(2);   // get website url

        System.out.println("Category: " + rs.getString(3) + " " + id + " " + url);

        if ( isValidURL(url) && connectionOK(url) ) {
        // checked url syntax and connection 
            String htmlInPage = downloadHTML(url);
            if (!htmlInPage.equals("")) {
            // add result to db
                insertDataToDb( c, id, htmlInPage);
             }
        }
    }
    rs.close();
 } catch (SQLException e) {
        e.printStackTrace();
 }
    closeConnection(c);  // database connection closed

函数 donloadHTML 使用 JSoup 库来完成主要工作。

感觉我的任务有点像"producer consumer problem"。我想它可以这样表示：有一个缓冲区，包含N个链接；一些进程从中获取链接并下载 HTML；和一个进程，其目的是在缓冲区变空时将新的 url 从数据库加载到缓冲区中。
但我完全不知道该怎么做。我听说过 Threads 和 ExecutorService 提供 ThreadPools 但我真的很困惑。

Answer 1

您可能想使用 Thread pool that has fixed amount of thread. Your program will first create a thread pool. Then it will read URLs from database. When a URL is read, the program will start a new task 下载其内容。

您的程序可能会维护一个 queue. When a task finish downloading HTMLs, it can push the URL and the result together into a queue. When the main thread finish reading URLs and starting tasks, it can wait for the queue. Once the queue have any responses, take the response out 并将其写入数据库。主线程可以统计收到了多少响应，当统计到URL秒的时候，所有任务就完成了。

你的程序可以写一个class来存储响应URL，例如：

class response {
    public String URL;
    public String result;
    public response(String u, String r) { this.URL = u; this.result = r; }
}

如果你在执行或理解上还有什么问题（我可能解释的不够清楚，现在是00:40，我可能很快就要睡觉了。），请留言。如果你想要代码，也请留下评论。

Answer 2

主线程：

启动 X "downloading" 个线程
运行查询显示有问题。 for 每条记录：
- 将查询中的数据添加到 ArrayBlockingQueue
将数据结束标记添加到队列
等待线程停止（可选）
Return 来自 main

下载线程：

从队列中获取数据。 while 不是数据结束标记：
- 下载HTML
- 将HTML插入数据库
将数据结束标记放回队列以供其他线程查找
退出线程

在 java 中使用多线程下载

use of multithreading for downloading in java

java

concurrency

multithreading

java-threads