机械化处理错误
mechanize dealing with errors
你会通过一系列问题看到我已经构建了一个小的机械化任务来访问页面()找到 link 去咖啡馆并将咖啡馆的详细信息保存在 csv 中。
task :estimateone => :environment do
require 'mechanize'
require 'csv'
mechanize = Mechanize.new
mechanize.history_added = Proc.new { sleep 30.0 }
mechanize.ignore_bad_chunking = true
mechanize.follow_meta_refresh = true
page = mechanize.get('http://www.siteexamplea.com/city/list/50-city-cafes-you-should-have-eaten-breakfast-at')
results = []
results << ['name', 'streetAddress', 'addressLocality', 'postalCode', 'addressRegion', 'addressCountry', 'telephone', 'url']
page.css('ol li a').each do |link|
mechanize.click(link)
name = mechanize.page.css('article h1[itemprop="name"]').text.strip
streetAddress = mechanize.page.css('address span span[itemprop="streetAddress"]').text.strip
addressLocality = mechanize.page.css('address span span[itemprop="addressLocality"]').text.strip
postalCode = mechanize.page.css('address span span[itemprop="postalCode"]').text.strip
addressRegion = mechanize.page.css('address span span[itemprop="addressRegion"]').text.strip
addressCountry = mechanize.page.css('address span meta[itemprop="addressCountry"]').text.strip
telephone = mechanize.page.css('address span[itemprop="telephone"]').text.strip
url = mechanize.page.css('article p a[itemprop="url"]').text.strip
tags = mechanize.page.css('article h1[itemprop="name"]').text.strip
results << [name, streetAddress, addressLocality, postalCode, addressRegion, addressCountry, telephone, url]
end
CSV.open("filename.csv", "w+") do |csv_file|
results.each do |row|
csv_file << row
end
end
end
当我到达第十个时 link 我遇到了 503 错误。
Mechanize::ResponseCodeError: 503 => Net::HTTPServiceUnavailable for https://www.city.com/city/directory/morning-after -- unhandled response
我已经尝试了一些方法来阻止这种情况发生或从这种状态中解救出来,但我无法解决。有什么建议吗?
你想在请求失败时进行救援,just like here
task :estimateone => :environment do
require 'mechanize'
require 'csv'
begin
# ...
page = mechanize.get('http://www.theurbanlist.com/brisbane/a-list/50-brisbane-cafes-you-should-have-eaten-breakfast-at')
rescue Mechanize::ResponseCodeError
# do something with the result, log it, write it, mark it as failed, wait a bit and then continue the job
next
end
end
我的猜测是您达到了 API 速率限制。这不会解决您的问题,因为它不在您身边,而是在服务器端;但会为您提供工作范围,因为现在您可以标记无效的链接并从那里继续。
你会通过一系列问题看到我已经构建了一个小的机械化任务来访问页面()找到 link 去咖啡馆并将咖啡馆的详细信息保存在 csv 中。
task :estimateone => :environment do
require 'mechanize'
require 'csv'
mechanize = Mechanize.new
mechanize.history_added = Proc.new { sleep 30.0 }
mechanize.ignore_bad_chunking = true
mechanize.follow_meta_refresh = true
page = mechanize.get('http://www.siteexamplea.com/city/list/50-city-cafes-you-should-have-eaten-breakfast-at')
results = []
results << ['name', 'streetAddress', 'addressLocality', 'postalCode', 'addressRegion', 'addressCountry', 'telephone', 'url']
page.css('ol li a').each do |link|
mechanize.click(link)
name = mechanize.page.css('article h1[itemprop="name"]').text.strip
streetAddress = mechanize.page.css('address span span[itemprop="streetAddress"]').text.strip
addressLocality = mechanize.page.css('address span span[itemprop="addressLocality"]').text.strip
postalCode = mechanize.page.css('address span span[itemprop="postalCode"]').text.strip
addressRegion = mechanize.page.css('address span span[itemprop="addressRegion"]').text.strip
addressCountry = mechanize.page.css('address span meta[itemprop="addressCountry"]').text.strip
telephone = mechanize.page.css('address span[itemprop="telephone"]').text.strip
url = mechanize.page.css('article p a[itemprop="url"]').text.strip
tags = mechanize.page.css('article h1[itemprop="name"]').text.strip
results << [name, streetAddress, addressLocality, postalCode, addressRegion, addressCountry, telephone, url]
end
CSV.open("filename.csv", "w+") do |csv_file|
results.each do |row|
csv_file << row
end
end
end
当我到达第十个时 link 我遇到了 503 错误。
Mechanize::ResponseCodeError: 503 => Net::HTTPServiceUnavailable for https://www.city.com/city/directory/morning-after -- unhandled response
我已经尝试了一些方法来阻止这种情况发生或从这种状态中解救出来,但我无法解决。有什么建议吗?
你想在请求失败时进行救援,just like here
task :estimateone => :environment do
require 'mechanize'
require 'csv'
begin
# ...
page = mechanize.get('http://www.theurbanlist.com/brisbane/a-list/50-brisbane-cafes-you-should-have-eaten-breakfast-at')
rescue Mechanize::ResponseCodeError
# do something with the result, log it, write it, mark it as failed, wait a bit and then continue the job
next
end
end
我的猜测是您达到了 API 速率限制。这不会解决您的问题,因为它不在您身边,而是在服务器端;但会为您提供工作范围,因为现在您可以标记无效的链接并从那里继续。