递归 ruby 方法中的明显内存泄漏

Question

此脚本一运行，我就看到 CPU 使用率和服务器上的磁盘 IO 水平稳步上升，直到它最终被杀死。

这是一个通过从数据库中挑选未抓取的 url、抓取它并将其链接添加到数据库来递归抓取站点的脚本。

我假设函数内或函数与 ActiveRecord 交互的方式存在某种内存泄漏。有什么办法可以提高效率并堵住漏洞吗？

def self.site project, operate

  @log = Logger.new(STDOUT)

  recurse = ->() do
    #
    # Pick a from the database to crawl
    unless ProjectData.where( status: 'unscraped', project_id: project[:id] ).exists?
      @log.info "No pages to scrape"
      return
    end  

    working_page = ProjectData.where( status: 'unscraped', project_id: project[:id]).first
    working_page.status = 'processing'
    working_page.save

    @log.info "Scraping #{working_page.url}"
    #
    #   Scape it
    data, links = OutriderTools::Scrape::page( working_page.url, operate)

    unless links.nil? 
      links.each  do |link|
        # Check if link already exists
        #if ProjectData.find_by(url: link.to_s).nil?
        unless ProjectData.where( url: link.to_s, project_id: project[:id] ).exists?  
          ProjectData.create({
            :url        => link.to_s,
            :status     => 'unscraped',
            :project_id => project[:id]
          })
          @log.info "Adding new url to database: #{link.to_s}"
        else
          @log.info "URL already exists in database: #{link.to_s}"
        end
      end
    end

    @log.info "Saving page data for url #{working_page.url}"
    @log.info data[:status]
    working_page.update( data ) unless data.nil?

    recurse.call

  end

  recurse.call

end

Answer 1

您应该确保在完成时将状态设置为 unscraped 以外的状态页面。对我来说，除非 data.nil，否则什么 working_page.update( data ) 是不清楚的？做。我也认为使用递归没有意义。您可以使用无限循环并在没有更多页面时中断。使用递归可能会填满内存很快。大多数此类脚本都很慢，并且在由 Web 服务器执行时可能会导致超时。您应该运行将脚本作为某种预定作业。

Answer 2

首先让我向您指出 this article 我最近读到有关内存泄漏的内容，它是精彩的 Ruby 每周时事通讯的一部分。

那个 sead，它主要是高级的东西，大多数时候更传统的简单方法工作得更快。

在我看来，问题的最可能根源是递归，摆脱它。

您的代码的某些部分还可以更加精简。例如

working_page = ProjectData.where( status: 'unscraped', project_id: project[:id]).first
    working_page.status = 'processing'
    working_page.save

可能是

working_page = ProjectData.where( status: 'unscraped', project_id: project[:id]).first_or_create(status: 'processing')

同样的技巧

unless ProjectData.where( url: link.to_s, project_id: project[:id] ).exists?  
          ProjectData.create({
            :url        => link.to_s,
            :status     => 'unscraped',
            :project_id => project[:id]
          })

可能是（并且不要混合新旧哈希符号）

hash = {url: link.to_s, status: 'unscraped', project_id: project[:id]})
ProjectData.where(hash).first_or_create(hash.merge({status: 'unscraped'}))

您可以使用

去掉最后一个额外的关卡

return if links.nil?

你最好注释掉所有不是绝对必要的东西，例如日志记录甚至保存到数据库，从几行开始，看看它在不增加内存的情况下工作，然后通过删除来建立评论。

Answer 3

只是一个想法，不是答案：

我希望您知道，通过使用递归，您可以将所有收集到的数据和变量保存在内存中——在递归结束之前，它们永远不会被释放。

例如，working_page 和 links 变量都在内存中保持活动状态（连同 DB ActiveRecord class），而新的 working_page 和 links变量在递归名内创建-space.

可能没有内存泄漏，只是设计问题。

除非您在递归之后再次需要该数据——您似乎不需要——最好使用 while 循环：

working_page = nil
while (working_page = ProjectData.where( status: 'unscraped', project_id: project[:id] ).first)
   # ... do your thing...
end

（= 不是错误。它被用作赋值，整个赋值被审查以检查 working_page 是否存在并分配给它的对象）

递归 ruby 方法中的明显内存泄漏

Apparent memory leak in recursive ruby method

ruby

recursion

memory-leaks

ruby-on-rails