增加线程数
Increasing number of threads
我正在尝试使用 Heritrix 从某一特定域抓取网页。
抓取速度似乎很慢。我注意到的一件事是,虽然有 25 个线程,但其中 24 个始终处于空闲状态。似乎只有一个线程正在主动从队列中获取 URI 并从服务器获取数据。
Rates
0.33 URIs/sec (0.34 avg); 18 KB/sec (20 avg)
Load
1 active of 25 threads; 1 congestion ratio; 13193 deepest queue; 13193 average depth
Elapsed
1h32m3s424ms
Threads
25 threads: 24 ABOUT_TO_GET_URI, 1 ABOUT_TO_BEGIN_PROCESSOR; 24 noActiveProcessor, 1 fetchHttp
Frontier
RUN - 2 URI queues: 1 active (1 in-process; 0 ready; 0 snoozed); 0 inactive; 0 ineligible; 0 retired; 1 exhausted
Memory
79933 KiB used; 143508 KiB current heap; 253440 KiB max heap
我可以使用任何配置来利用所有 25 个线程吗?我已经发现并更改了与礼貌相关的配置(min/max 延迟)谢谢!
从邮件列表中找到答案:在 queueAssignmentPolicy
bean 中设置 parallelQueues
。
parallelQueues: default value (and historical behavior) is '1'. If
instead N, all URIs that previously went into the same single-named
queue will go into N related queues (via a consistent hash-mapping of
the path?query portion of the URL). Each queue is considered
separately for traditional politeness based on one-at-a-time
connections and snooze-delays-between-fetches -- so N queues means N
fetches could be in progress against a site at once. Thus, should only
be used in an overlay setting, applied to sites likely to handle
multiple connections well.
我正在尝试使用 Heritrix 从某一特定域抓取网页。
抓取速度似乎很慢。我注意到的一件事是,虽然有 25 个线程,但其中 24 个始终处于空闲状态。似乎只有一个线程正在主动从队列中获取 URI 并从服务器获取数据。
Rates
0.33 URIs/sec (0.34 avg); 18 KB/sec (20 avg)
Load
1 active of 25 threads; 1 congestion ratio; 13193 deepest queue; 13193 average depth
Elapsed
1h32m3s424ms
Threads
25 threads: 24 ABOUT_TO_GET_URI, 1 ABOUT_TO_BEGIN_PROCESSOR; 24 noActiveProcessor, 1 fetchHttp
Frontier
RUN - 2 URI queues: 1 active (1 in-process; 0 ready; 0 snoozed); 0 inactive; 0 ineligible; 0 retired; 1 exhausted
Memory
79933 KiB used; 143508 KiB current heap; 253440 KiB max heap
我可以使用任何配置来利用所有 25 个线程吗?我已经发现并更改了与礼貌相关的配置(min/max 延迟)谢谢!
从邮件列表中找到答案:在 queueAssignmentPolicy
bean 中设置 parallelQueues
。
parallelQueues: default value (and historical behavior) is '1'. If instead N, all URIs that previously went into the same single-named queue will go into N related queues (via a consistent hash-mapping of the path?query portion of the URL). Each queue is considered separately for traditional politeness based on one-at-a-time connections and snooze-delays-between-fetches -- so N queues means N fetches could be in progress against a site at once. Thus, should only be used in an overlay setting, applied to sites likely to handle multiple connections well.