nutch 1.13 中 fetcher.server.min.delay 和 fetcher.threads.fetch 之间的关系
relation between fetcher.server.min.delay and fetcher.threads.fetch in nutch 1.13
我 运行 在本地模式下发疯,服务器配置为 64 GB RAM 和 32 processor.if 我在种子列表中有一个 url 并且在 [=19= 中有以下配置]
fetcher.threads.fetch =16
fetcher.threads.per.queue=2
fetcher.max.crawl.delay=120
fetcher.queue.depth.multiplier=150
fetcher.queue.mode=byHost
如果 -topN 设置为 1000,在 Fetch 阶段将向 url 发出多少请求
将为 Fetcher 创建多个地图任务,我的理解是创建单个地图任务,而不管需要从 fetchlist 中获取的 urls 的数量
我试着用谷歌搜索 fetcher.threads.fetch 和 fetcher.threads.per.queue 之间的关系,但没有找到任何明确的东西
还从 fetcher Phase
添加日志
FetcherThread INFO fetcher.FetcherThread (277) - fetching
http://investors.te.com/news-releases/press-release-details/2018/TE-
Connectivity-announces-fourth-quarter-and-full-year-resu
lts-for-fiscal-year-2018/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching http://investors.te.com/shareholder-info/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/news-releases/press-release-details/2019/TE-Connectivity-to-hold-annual-general-meeting-of-shareholders-on-March-13-2019/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/request-information/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/email-alerts/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/site-map/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/rss/PressRelease.aspx?LanguageId=1&CategoryWorkflowId=00000000-0000-0000-0000-000000000000&tags= (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/stock-information/quote-and-chart/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/overview/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/investor-contacts/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/js/mobileRedirect.js (queue crawl delay=10000ms)
只有一个请求,因为只有一个 URL。如果有两个 URL 来自具有 fetcher.threads.per.queue=2
的单个主机,则可以同时向同一主机发出两个请求。大量 fetcher.threads.fetch
仅在您要抓取大量主机或您正在抓取自己的本地快速响应网络服务器时才有意义。在后一种情况下,fetcher.threads.per.queue
应该等于或接近 fetcher.threads.fetch
。如果它不是你自己的服务器并且你没有被明确允许,你应该始终保持 fetcher.threads.per.queue
的默认值,这是一个单线程(=1),没有到同一主机的并行连接,并且连续请求之间有保证的延迟。
我 运行 在本地模式下发疯,服务器配置为 64 GB RAM 和 32 processor.if 我在种子列表中有一个 url 并且在 [=19= 中有以下配置]
fetcher.threads.fetch =16
fetcher.threads.per.queue=2
fetcher.max.crawl.delay=120
fetcher.queue.depth.multiplier=150
fetcher.queue.mode=byHost
如果 -topN 设置为 1000,在 Fetch 阶段将向 url 发出多少请求 将为 Fetcher 创建多个地图任务,我的理解是创建单个地图任务,而不管需要从 fetchlist 中获取的 urls 的数量 我试着用谷歌搜索 fetcher.threads.fetch 和 fetcher.threads.per.queue 之间的关系,但没有找到任何明确的东西 还从 fetcher Phase
添加日志FetcherThread INFO fetcher.FetcherThread (277) - fetching
http://investors.te.com/news-releases/press-release-details/2018/TE-
Connectivity-announces-fourth-quarter-and-full-year-resu
lts-for-fiscal-year-2018/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching http://investors.te.com/shareholder-info/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/news-releases/press-release-details/2019/TE-Connectivity-to-hold-annual-general-meeting-of-shareholders-on-March-13-2019/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/request-information/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/email-alerts/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/site-map/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/rss/PressRelease.aspx?LanguageId=1&CategoryWorkflowId=00000000-0000-0000-0000-000000000000&tags= (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/stock-information/quote-and-chart/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/overview/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/investor-contacts/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/js/mobileRedirect.js (queue crawl delay=10000ms)
只有一个请求,因为只有一个 URL。如果有两个 URL 来自具有 fetcher.threads.per.queue=2
的单个主机,则可以同时向同一主机发出两个请求。大量 fetcher.threads.fetch
仅在您要抓取大量主机或您正在抓取自己的本地快速响应网络服务器时才有意义。在后一种情况下,fetcher.threads.per.queue
应该等于或接近 fetcher.threads.fetch
。如果它不是你自己的服务器并且你没有被明确允许,你应该始终保持 fetcher.threads.per.queue
的默认值,这是一个单线程(=1),没有到同一主机的并行连接,并且连续请求之间有保证的延迟。