StormCrawler:等待来自池的连接超时

StormCrawler: Timeout waiting for connection from pool

当我们增加 Fetcher bolt 的线程数或执行器数时,我们一直收到以下错误。

org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286) ~[stormjar.jar:?]
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.get(PoolingHttpClientConnectionManager.java:263) ~[stormjar.jar:?]
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190) ~[stormjar.jar:?]
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) ~[stormjar.jar:?]
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[stormjar.jar:?]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:71) ~[stormjar.jar:?]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:220) ~[stormjar.jar:?]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:164) ~[stormjar.jar:?]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:139) ~[stormjar.jar:?]
at com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol.getProtocolOutput(HttpProtocol.java:206) ~[stormjar.jar:?]

这是由于资源泄漏还是对 http 线程池大小的一些硬性限制?如果是关于线程池的,有什么方法可以增加线程池的大小吗?

HttpProtocol 中设置了池的最大连接数,即使用的线程数 (fetcher.threads.number)。由于池是静态的,它被同一个 worker 上的所有执行者使用。我建议您为每个工作人员使用一个 FetcherBolt 实例,这样它的值将与 fetcher.threads.number 相同,并且您不会遇到这个问题。

或者,您可以提供 okhttp protocol a try. It is more robust for open and large-scale crawls. See WIKI page on protocols 进行功能比较。