Mechanize 无法连接到站点

Mechanize won't conect to site

欢迎,我遇到了一个问题,gem mechanize 无法连接到站点。 Gem 已安装。 代码:

require 'mechanize'

agent = Mechanize.new
main_page = agent.get 'https://imbd.com'
main_page.link_with(text: "Top 250").click
rows = list_page.root.css(".lister-list tr")

puts rows.size

这是一个错误:

C:/Ruby/lib/ruby/2.2.0/net/http.rb:879:in `initialize': A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. - connect(2) for "imbd.com" port 80 (Errno::ETIMEDOUT)
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:879:in `open'
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:879:in `block in connect'
    from C:/Ruby/lib/ruby/2.2.0/timeout.rb:73:in `timeout'
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:878:in `connect'
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:863:in `do_start'
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:858:in `start'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:700:in `start'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:631:in `connection_for'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:994:in `request'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/mechanize-2.7.4/lib/mechanize/http/agent.rb:267:in `fetch'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/mechanize-2.7.4/lib/mechanize.rb:464:in `get'
    from C:/Ruby/Workspace/imbd.rb:4:in `<main>'

有谁知道出了什么问题吗?谢谢!

查看 imdb 后,我发现它们 运行 大量 javascript 这将使 Mechanize 崩溃,因为它无法解析 js 和理解传入的响应。如果您想抓取内容或自动浏览,我建议您使用 Capybara 而不是 Mechanize。将 Capybara 与 Poltergeist(您需要使用此方法安装 phantom.js)相结合将比 Mechanize 更好地工作,并且专为自动与加载大量 js 的页面进行交互而构建。

我添加了一种可能为您解决该错误的方法。如果这有效,那是因为 Mechanize 试图在 js 脚本完成之前获取页面,因此没有获取有效数据。

编辑:

  agent = Mechanize.new
  agent.read_timeout=3  #set the agent time out
  begin
  main_page = agent.get 'https://imbd.com'
  main_page.link_with(text: "Top 250").click
  rows = list_page.root.css(".lister-list tr")
  rescue Timeout::Error 
    puts "Timeout!"
    puts "read_timeout attribute is set to #{agent.read_timeout}s" if !agent.read_timeout.nil?
  end

虽然 mechanize 确实不支持 javascript,但您的问题是您正试图访问一个不存在的站点。您正在尝试访问 www.imbd.com 而不是 www.imdb.com。所以,错误信息是准确的。

而且 FWIW,IMDB 不希望您抓取他们的网站:

Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.