Java/GPars - 我的线程池似乎得到 "clogged"

Question

我在做什么：我正在数据库中查看 table 家公司...每家公司都有一个文本 description 字段，在该字段内可以有一个超链接的数量（很少超过 4 个）。我想要做的是使用 curl 测试这些链接以获得 "bad" 响应（通常为 404，但任何非 200 的内容都会感兴趣）。

顺便说一下，这对 Java 和 Groovy 一样适用，毫无疑问，任何一方的人都可能有兴趣知道这里使用的底层线程池 class GPars（Groovy 并行度）是 ForkJoinPool.

使用 Matcher 使用 Pattern /(https?:.*?)\)/ 收集了这些 URL，我得到了 "url" 的地图 descripURLs --> "name of company".然后我使用 withPool 大容量（显然是因为等待响应的内在延迟），如下所示：

startMillis = System.currentTimeMillis() 
AtomicInteger nRequest = new AtomicInteger()
AtomicInteger nResponsesReceived = new AtomicInteger()
poolObject = null
resultP = withPool( 50 ){ pool ->
    poolObject = pool
    descripURLs.eachParallel{ url, name ->
        int localNRequest = nRequest.incrementAndGet()
        Process process = checkURL( url )

        def response
        try {
            //// with the next line TIME PASSES in this Thread...
            response = process.text
        } catch( Exception e ) {
            System.err.println "$e"
        }
        // NB this line doesn't appear to make much difference
        process.destroyForcibly()
        nResponses = nResponsesReceived.incrementAndGet()
        int nRequestsNowMade = nRequest.get()
        if( response.trim() != '200' ) {
            println "\n*** request $localNRequest BAD RESPONSE\nname $name url $url\nresponse |$response|" +
                "\n$nRequestsNowMade made, outstanding ${nRequestsNowMade - nResponses}"
             // NB following line may of course not be printed immmediately after the above line, due to parallelism
            println "\nprocess poolSize $pool.poolSize, queuedTaskCount $pool.queuedTaskCount," +
                " queuedSubmissionCount? $pool.queuedSubmissionCount"   
        }
        println "time now ${System.currentTimeMillis() - startMillis}, activeThreadCount $pool.activeThreadCount"
    }
    println "END OF withPool iterations"
    println "pool $pool class ${pool.class.simpleName}, activeThreadCount $pool.activeThreadCount"
    pool.shutdownNow()
}

println "resultP $resultP class ${resultP.class.simpleName}"
println "pool $poolObject class ${poolObject.class.simpleName}"
println "pool shutdown? $poolObject.shutdown"

def checkURL( url ) {
    def process =  "curl -LI $url -o /dev/null -w '%{http_code}\n' -s".execute()
    // this appears necessary... otherwise potentially you can have processes hanging around forever
    process.waitForOrKill( 8000 ) // 8 s to get a reponse
    process.addShutdownHook{
        println "shutdown on url $url"
    }
    process
}

我在上面的 50 线程池中观察到，500 个 URL 需要 20 秒才能完成。我已经尝试过更小和更大的池，100 似乎没有什么区别，但 25 似乎更慢，而 10 更像是 40 秒才能完成。对于相同的池大小，时间也与运行运行非常一致。

我不明白的是，Processes 的关闭挂钩仅在关闭的最后运行...对于所有 500 个 Processes！这并不是说机器上有 500 个实际进程：使用任务管理器我可以看到任何时候 curl.exe 个进程的数量相对较少。

与此同时，我从此处的 println 观察到这里的活动线程数从 50 开始，但随后在整个运行期间下降，到最后达到 3（通常）。而且...我还可以观察到最终请求仅在运行接近尾声时添加。

这让我想知道线程池是否在某种程度上被这些 "zombie" Processes 中的 "unfinished business" "clogged up"...我希望最后的请求（提出的 500 个）将在运行结束之前提出。有什么办法可以提前关闭这些 Processes 吗？

Answer 1

Java 和 Groovy 都不支持子 Process 实例上的方法 addShutdownHook。

Java 支持的唯一方法 addShutdownHook 在 Runtime 实例上。这会在 JVM 关闭时向运行 添加一个挂钩 。

Groovy 为 Object class 添加了一个方便的 addShutdownHook() 这样你就不必写 Runtime.getRuntime().addShutdownHook(..)，但这对底层机制：这些钩子仅在 JVM 关闭时执行。

因为您使用 process.addShutdownHook 添加的闭包很可能会保留对 process 实例的引用，所以它们将一直存在直到 JVM 关闭（Java 对象，但不是OS 个进程）

Java/GPars - 我的线程池似乎得到 "clogged"

Java/GPars - my thread pool seems to get "clogged"

java

concurrency

groovy

process

gpars