crawler4j 异步保存结果到文件

Question

我正在评估 crawler4j 每天大约 100 万次爬行我的场景是这样的：我正在获取 URL 并解析它的描述、关键字和标题，现在我想将每个 URL 及其文字保存到一个文件中

我已经看到了 save crawled data to files 的可能性。但是，由于我要执行许多爬网操作，所以我希望不同的线程在文件系统上执行保存文件操作（以免阻塞获取线程）。这可能与 crawler4j 有关吗？如果可以，怎么做？

谢谢

Answer 1

考虑使用 Queue (BlockingQueue 或类似的）放置要写入的数据，然后由 one/more 工作线程处理（此方法不是 crawler4j 特定的）。搜索 "producer consumer" 以获得一些大致的想法。

关于如何将 Queue 传递给爬虫实例的后续问题，这应该可以解决问题（这仅来自于查看 source 代码，尚未使用我自己的 crawler4j):

final BlockingQueue<Data> queue = …

// use a factory, instead of supplying the crawler type to pass the queue
controller.start(new WebCrawlerFactory<MyCrawler>() {
    @Override
    public MyCrawler newInstance() throws Exception {
        return new MyCrawler(queue);
    }
}, numberOfCrawlers);

crawler4j 异步保存结果到文件

crawler4j asynchronously saving results to file

java

asynchronous

web-scraping

crawler4j