Nokogiri 刮在 Rails
Nokogiri Scraping In Rails
所以我的索引操作中有这段代码,很想将它移到一个模型中,只是对如何做有点困惑。
原码
def index
urls = %w[http://cltampa.com/blogs/potlikker http://cltampa.com/blogs/artbreaker http://cltampa.com/blogs/politicalanimals http://cltampa.com/blogs/earbuds http://cltampa.com/blogs/dailyloaf http://cltampa.com/blogs/bedpost]
@final_images = []
@final_urls = []
urls.each do |url|
blog = Nokogiri::HTML(open(url))
images = blog.xpath('//*[@class="postBody"]/div[1]//img/@src')
images.each do |image|
@final_images << image
end
story_path = blog.xpath('//*[@class="postTitle"]/a/@href')
story_path.each do |path|
@final_urls << path
end
end
end
我在我的模型中测试了这段代码,它非常适合一个 url,只是不确定如何像原始代码一样集成所有 url。
新代码
型号
class Photocloud < ActiveRecord::Base
attr_reader :url, :data
def initialize(url)
@url = url
end
def data
@data ||= Nokogiri::HTML(open(url))
end
def get_elements(path)
data.xpath(path)
end
end
控制器
def index
@scraper = Photocloud.new('http://cltampa.com/blogs/artbreaker')
@photos = @scraper.get_elements('//*[@class="postBody"]/div[1]//img/@src')
@story_urls = @scraper.get_elements('//*[@class="postBody"]/div[1]//img/@src')
end
我的主要问题是如何初始化多个 url 并像我的原始代码一样循环遍历它们。我尝试过不同的东西,但感觉好像碰壁了。我需要将它们保存到数据库中,但我想先让它工作。非常感谢任何帮助。
更新控制器 - WIP
def index
start_urls = %w[http://cltampa.com/blogs/potlikker
http://cltampa.com/blogs/artbreaker
http://cltampa.com/blogs/politicalanimals
http://cltampa.com/blogs/earbuds
http://cltampa.com/blogs/dailyloaf
http://cltampa.com/blogs/bedpost]
@scraper = Photocloud.new(start_urls)
@images =
@paths =
end
这部分需要一些帮助...
您似乎没有保留抓取的图像和数据库路径,因此 Photocloud
不需要继承自 ActiveRecord::Base
- 它可以只是一个普通的旧 ruby 对象 (PORO):
class Photocloud
attr_reader :start_urls
attr_accessor :images, :paths
def initialize(start_urls)
@start_urls = start_urls
@images = []
@paths = []
end
def scrape
start_urls.each do |start_url|
blog = Nokogiri::HTML(open(url))
scrape_images(blog)
scrape_paths(blog)
end
end
private
def scrape_images(blog)
images = blog.xpath('//*[@class="postBody"]/div[1]//img/@src')
images.each do |image|
images << image
end
end
def scrape_paths(blog)
story_path = blog.xpath('//*[@class="postTitle"]/a/@href')
story_path.each do |path|
paths << path
end
end
end
在控制器中:
scraper = Photocloud.new(start_urls)
scraper.scrape
@images = scraper.images
@paths = scraper.paths
当然,这只是构建代码的一种可能性。
所以我的索引操作中有这段代码,很想将它移到一个模型中,只是对如何做有点困惑。
原码
def index
urls = %w[http://cltampa.com/blogs/potlikker http://cltampa.com/blogs/artbreaker http://cltampa.com/blogs/politicalanimals http://cltampa.com/blogs/earbuds http://cltampa.com/blogs/dailyloaf http://cltampa.com/blogs/bedpost]
@final_images = []
@final_urls = []
urls.each do |url|
blog = Nokogiri::HTML(open(url))
images = blog.xpath('//*[@class="postBody"]/div[1]//img/@src')
images.each do |image|
@final_images << image
end
story_path = blog.xpath('//*[@class="postTitle"]/a/@href')
story_path.each do |path|
@final_urls << path
end
end
end
我在我的模型中测试了这段代码,它非常适合一个 url,只是不确定如何像原始代码一样集成所有 url。
新代码
型号
class Photocloud < ActiveRecord::Base
attr_reader :url, :data
def initialize(url)
@url = url
end
def data
@data ||= Nokogiri::HTML(open(url))
end
def get_elements(path)
data.xpath(path)
end
end
控制器
def index
@scraper = Photocloud.new('http://cltampa.com/blogs/artbreaker')
@photos = @scraper.get_elements('//*[@class="postBody"]/div[1]//img/@src')
@story_urls = @scraper.get_elements('//*[@class="postBody"]/div[1]//img/@src')
end
我的主要问题是如何初始化多个 url 并像我的原始代码一样循环遍历它们。我尝试过不同的东西,但感觉好像碰壁了。我需要将它们保存到数据库中,但我想先让它工作。非常感谢任何帮助。
更新控制器 - WIP
def index
start_urls = %w[http://cltampa.com/blogs/potlikker
http://cltampa.com/blogs/artbreaker
http://cltampa.com/blogs/politicalanimals
http://cltampa.com/blogs/earbuds
http://cltampa.com/blogs/dailyloaf
http://cltampa.com/blogs/bedpost]
@scraper = Photocloud.new(start_urls)
@images =
@paths =
end
这部分需要一些帮助...
您似乎没有保留抓取的图像和数据库路径,因此 Photocloud
不需要继承自 ActiveRecord::Base
- 它可以只是一个普通的旧 ruby 对象 (PORO):
class Photocloud
attr_reader :start_urls
attr_accessor :images, :paths
def initialize(start_urls)
@start_urls = start_urls
@images = []
@paths = []
end
def scrape
start_urls.each do |start_url|
blog = Nokogiri::HTML(open(url))
scrape_images(blog)
scrape_paths(blog)
end
end
private
def scrape_images(blog)
images = blog.xpath('//*[@class="postBody"]/div[1]//img/@src')
images.each do |image|
images << image
end
end
def scrape_paths(blog)
story_path = blog.xpath('//*[@class="postTitle"]/a/@href')
story_path.each do |path|
paths << path
end
end
end
在控制器中:
scraper = Photocloud.new(start_urls)
scraper.scrape
@images = scraper.images
@paths = scraper.paths
当然,这只是构建代码的一种可能性。