Mechanize 无法连接到站点

Question

欢迎，我遇到了一个问题，gem mechanize 无法连接到站点。 Gem 已安装。代码：

require 'mechanize'

agent = Mechanize.new
main_page = agent.get 'https://imbd.com'
main_page.link_with(text: "Top 250").click
rows = list_page.root.css(".lister-list tr")

puts rows.size

这是一个错误：

C:/Ruby/lib/ruby/2.2.0/net/http.rb:879:in `initialize': A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. - connect(2) for "imbd.com" port 80 (Errno::ETIMEDOUT)
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:879:in `open'
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:879:in `block in connect'
    from C:/Ruby/lib/ruby/2.2.0/timeout.rb:73:in `timeout'
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:878:in `connect'
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:863:in `do_start'
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:858:in `start'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:700:in `start'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:631:in `connection_for'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:994:in `request'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/mechanize-2.7.4/lib/mechanize/http/agent.rb:267:in `fetch'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/mechanize-2.7.4/lib/mechanize.rb:464:in `get'
    from C:/Ruby/Workspace/imbd.rb:4:in `<main>'

有谁知道出了什么问题吗？谢谢！

Answer 1

查看 imdb 后，我发现它们运行大量 javascript 这将使 Mechanize 崩溃，因为它无法解析 js 和理解传入的响应。如果您想抓取内容或自动浏览，我建议您使用 Capybara 而不是 Mechanize。将 Capybara 与 Poltergeist（您需要使用此方法安装 phantom.js）相结合将比 Mechanize 更好地工作，并且专为自动与加载大量 js 的页面进行交互而构建。

我添加了一种可能为您解决该错误的方法。如果这有效，那是因为 Mechanize 试图在 js 脚本完成之前获取页面，因此没有获取有效数据。

编辑：

  agent = Mechanize.new
  agent.read_timeout=3  #set the agent time out
  begin
  main_page = agent.get 'https://imbd.com'
  main_page.link_with(text: "Top 250").click
  rows = list_page.root.css(".lister-list tr")
  rescue Timeout::Error 
    puts "Timeout!"
    puts "read_timeout attribute is set to #{agent.read_timeout}s" if !agent.read_timeout.nil?
  end

Answer 2

虽然 mechanize 确实不支持 javascript，但您的问题是您正试图访问一个不存在的站点。您正在尝试访问 www.imbd.com 而不是 www.imdb.com。所以，错误信息是准确的。

而且 FWIW，IMDB 不希望您抓取他们的网站：

Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.

Mechanize 无法连接到站点

Mechanize won't conect to site

ruby

rubygems

ruby-on-rails

mechanize

mechanize-ruby