Nokogiri 刮在 Rails

Nokogiri Scraping In Rails

所以我的索引操作中有这段代码,很想将它移到一个模型中,只是对如何做有点困惑。

原码

  def index
    urls = %w[http://cltampa.com/blogs/potlikker http://cltampa.com/blogs/artbreaker http://cltampa.com/blogs/politicalanimals http://cltampa.com/blogs/earbuds http://cltampa.com/blogs/dailyloaf http://cltampa.com/blogs/bedpost]
    @final_images = []
    @final_urls = []
    
    urls.each do |url|
      blog = Nokogiri::HTML(open(url)) 
      images = blog.xpath('//*[@class="postBody"]/div[1]//img/@src')
      images.each do |image|
        @final_images << image
      end
      
      story_path = blog.xpath('//*[@class="postTitle"]/a/@href')
      story_path.each do |path|
        @final_urls << path
      end
    end  
  end

我在我的模型中测试了这段代码,它非常适合一个 url,只是不确定如何像原始代码一样集成所有 url。

新代码

型号

class Photocloud < ActiveRecord::Base

  attr_reader :url, :data

  def initialize(url)
    @url = url
  end

  def data
    @data ||= Nokogiri::HTML(open(url))
  end

  def get_elements(path)
    data.xpath(path)
  end

end

控制器

def index 
  @scraper = Photocloud.new('http://cltampa.com/blogs/artbreaker')
  @photos = @scraper.get_elements('//*[@class="postBody"]/div[1]//img/@src')
  @story_urls = @scraper.get_elements('//*[@class="postBody"]/div[1]//img/@src')
end

我的主要问题是如何初始化多个 url 并像我的原始代码一样循环遍历它们。我尝试过不同的东西,但感觉好像碰壁了。我需要将它们保存到数据库中,但我想先让它工作。非常感谢任何帮助。

更新控制器 - WIP

  def index
    start_urls = %w[http://cltampa.com/blogs/potlikker 
      http://cltampa.com/blogs/artbreaker 
      http://cltampa.com/blogs/politicalanimals 
      http://cltampa.com/blogs/earbuds 
      http://cltampa.com/blogs/dailyloaf 
      http://cltampa.com/blogs/bedpost]
    @scraper = Photocloud.new(start_urls)
    @images = 
    @paths = 
  end

这部分需要一些帮助...

您似乎没有保留抓取的图像和数据库路径,因此 Photocloud 不需要继承自 ActiveRecord::Base - 它可以只是一个普通的旧 ruby 对象 (PORO):

class Photocloud
  attr_reader :start_urls
  attr_accessor :images, :paths

  def initialize(start_urls)
    @start_urls = start_urls
    @images = []
    @paths = []
  end

  def scrape
    start_urls.each do |start_url|
      blog = Nokogiri::HTML(open(url))
      scrape_images(blog)
      scrape_paths(blog)
    end
  end

  private
  def scrape_images(blog)
    images = blog.xpath('//*[@class="postBody"]/div[1]//img/@src')
    images.each do |image|
      images << image
    end
  end

  def scrape_paths(blog)      
    story_path = blog.xpath('//*[@class="postTitle"]/a/@href')
    story_path.each do |path|
      paths << path
    end
  end
end

在控制器中:

scraper = Photocloud.new(start_urls)
scraper.scrape
@images = scraper.images
@paths = scraper.paths

当然,这只是构建代码的一种可能性。