增加线程数

Question

我正在尝试使用 Heritrix 从某一特定域抓取网页。

抓取速度似乎很慢。我注意到的一件事是，虽然有 25 个线程，但其中 24 个始终处于空闲状态。似乎只有一个线程正在主动从队列中获取 URI 并从服务器获取数据。

Rates
0.33 URIs/sec (0.34 avg); 18 KB/sec (20 avg)
Load
1 active of 25 threads; 1 congestion ratio; 13193 deepest queue; 13193 average depth
Elapsed
1h32m3s424ms
Threads
25 threads: 24 ABOUT_TO_GET_URI, 1 ABOUT_TO_BEGIN_PROCESSOR; 24 noActiveProcessor, 1 fetchHttp 
Frontier
RUN - 2 URI queues: 1 active (1 in-process; 0 ready; 0 snoozed); 0 inactive; 0 ineligible; 0 retired; 1 exhausted 
Memory
79933 KiB used; 143508 KiB current heap; 253440 KiB max heap

我可以使用任何配置来利用所有 25 个线程吗？我已经发现并更改了与礼貌相关的配置（min/max 延迟）谢谢！

Answer 1

从邮件列表中找到答案：在 queueAssignmentPolicy bean 中设置 parallelQueues。

parallelQueues: default value (and historical behavior) is '1'. If instead N, all URIs that previously went into the same single-named queue will go into N related queues (via a consistent hash-mapping of the path?query portion of the URL). Each queue is considered separately for traditional politeness based on one-at-a-time connections and snooze-delays-between-fetches -- so N queues means N fetches could be in progress against a site at once. Thus, should only be used in an overlay setting, applied to sites likely to handle multiple connections well.

增加线程数

Increasing number of threads

java

multithreading

web-crawler

heritrix