搜索提交后,Mechanize 未加载完整网页

Mechanize is not loading full webpage after search submit

为什么我在搜索表单中搜索“”时没有在 [this][1] 页面上获得产品 space 我只获得菜单而不是搜索结果产品

Ruby代码:

require 'nokogiri'
require 'mysql2'
require 'logger'
require 'mechanize'
agent = Mechanize.new{|a| a.log = Logger.new(STDERR) }
agent.user_agent_alias = 'Windows Mozilla'
agent.read_timeout = 60
def add_cookie(agent, uri, cookie)
uri = URI.parse(uri)
Mechanize::Cookie.parse(uri, cookie) do |cookie|
agent.cookie_jar.add(uri, cookie)
end
end
login_page = agent.get "http://www.example.com.mx/login.php?location=%2F"
login_form = login_page.form_with(:method => 'POST')
email_field = login_form.field_with(name: "correo_ingresar") 
password_field = login_form.field_with(name: "password") 
email_field.value = 'user@example.com'
password_field.value = 'password'
home_page = login_form.submit
myarray = home_page.body.scan(/SetCookie\(\"(.+)\", \"(.+)\"\)/)
myarray.each{|line| add_cookie agent, 'http://www.example.com.mx', "#{line[0]}=#{line[1]}"}
add_cookie(agent, 'http://www.example.com.mx', "forzar_existencias=1; path=/; domain=www.example.com.mx")
add_cookie(agent, 'http://www.example.com.mx', "articulos_mostrar=50; path=/; domain=www.example.com.mx")
add_cookie(agent, 'http://www.example.com.mx', "forz_existencias=1=; path=/; domain=www.example.com.mx")
add_cookie(agent, 'http://www.example.com.mx', "no_actualiza=1; path=/; domain=www.example.com.mx")
add_cookie(agent, 'http://www.example.com.mx', "orden_mostrar=8; path=/; domain=www.example.com.mx")
add_cookie(agent, 'http://www.example.com.mx', "page=1; path=/; domain=www.example.com.mx")
add_cookie(agent, 'http://www.example.com.mx', "precio_inicio=0; path=/; domain=www.example.com.mx")
add_cookie(agent, 'http://www.example.com.mx', "location=%2Farticulos.php%3Fbuscar%3D%2B; path=/; domain=www.example.com.mx")

search_form = home_page.forms.first
search_field = search_form.field_with(name: "buscar") 
search_field.value = ' '
search_results = search_form.submit
resultados = 'http://example.com.mx/articulos.php?buscar=+'

我用 firebug 下载了适用于 Firefox 的 Live HTTP Headers 插件。当我填写 space 并单击 [webpage][1] 上的搜索按钮时,我在实时 HTTP headers.

上得到以下结果
http://example.com.mx/articulos.php?buscar=+

GET /articulos.php?buscar=+ HTTP/1.1
Host: example.com.mx
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://example.com.mx/articulos.php?buscar=+
Cookie: _ga=GA1.3.162897808.1438611502; _gat=1
Connection: keep-alive

HTTP/1.1 200 OK
Date: Sat, 08 Aug 2015 04:29:40 GMT
Server: Apache
x-powered-by: PHP/5.4.30
Cache-Control: no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: 0
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html
----------------------------------------------------------
http://www.google-analytics.com/collect?v=1&_v=j37&a=1988602157&t=pageview&_s=1&dl=http%3A%2F%2Fexample.com.mx%2Farticulos.php%3Fbuscar%3D%2B&ul=en-us&de=UTF-8&dt=Sistemas%20Aplicados&sd=24-bit&sr=1920x1080&vp=1903x969&je=0&_u=AACAAEABI~&jid=&cid=162897808.1438611502&tid=UA-58813310-1&z=90642832

GET /collect?v=1&_v=j37&a=1988602157&t=pageview&_s=1&dl=http%3A%2F%2Fexample.com.mx%2Farticulos.php%3Fbuscar%3D%2B&ul=en-us&de=UTF-8&dt=Sistemas%20Aplicados&sd=24-bit&sr=1920x1080&vp=1903x969&je=0&_u=AACAAEABI~&jid=&cid=162897808.1438611502&tid=UA-58813310-1&z=90642832 HTTP/1.1
Host: www.google-analytics.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0
Accept: image/png,image/*;q=0.8,*/*;q=0.5
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://example.com.mx/articulos.php?buscar=+
Connection: keep-alive

HTTP/1.1 200 OK
Pragma: no-cache
Expires: Mon, 07 Aug 1995 23:30:00 GMT
Access-Control-Allow-Origin: *
Last-Modified: Sun, 17 May 1998 03:00:00 GMT
x-content-type-options: nosniff
Content-Type: image/gif
Date: Wed, 29 Jul 2015 12:33:33 GMT
Server: Golfe2
Content-Length: 35
Age: 834969
Alternate-Protocol: 80:quic,p=0
Cache-Control: private, no-cache, no-cache=Set-Cookie, proxy-revalidate
----------------------------------------------------------
http://example.com.mx/resultados.php

POST /resultados.php HTTP/1.1
Host: example.com.mx
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
X-Requested-With: XMLHttpRequest
Referer: http://example.com.mx/articulos.php?buscar=+
Content-Length: 204
Cookie: _ga=GA1.3.162897808.1438611502; _gat=1
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache
opcion=&buscar=+&page=1&articulos_mostrar=10&orden_mostrar=1&seccion=&linea=&sublinea=&forz_existencias=1&precio_inicio=0&precio_final=20000&location=%252Farticulos.php%253Fbuscar%253D%252B&no_actualiza=1
HTTP/1.1 200 OK
Date: Sat, 08 Aug 2015 04:29:42 GMT
Server: Apache
x-powered-by: PHP/5.4.30
Keep-Alive: timeout=5, max=99
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html
----------------------------------------------------------

问题是:如何让完整的产品显示在网页上,这样我就可以在它有引用 link 并且它不会自动获取产品的情况下开始抓取。 [This][2] 是结果 HTML:

我给出了 2 个解决方案,但只有一个使用了 POST,正如您在问题中所要求的:

require 'mechanize'

agent = Mechanize.new
agent.get("http://www.sistemasaplicados.com.mx/")
agent.page.forms.first.field_with(name: "buscar").value = ' '
result_page = agent.page.forms.first.submit

另一种选择是对您的搜索词进行编码,并在简单的 GET 请求(在 URL 中编码)中直接使用 nokogiri。在您的特定情况下,搜索“160GB”会导致以下 URL http://www.sistemasaplicados.com.mx/articulos.php?buscar=160GB,您可以 GET.

顺便说一句,您不一定需要对所有这些进行机械化处理,除非您希望自动将订单存入您的帐户或类似的东西。我假设你这样做是为了 sistemasaplicados 的利益,否则我会认为这是粗鲁的,它会给你带来恶业。

更新 当手动检查正在发生的事情时,你应该看看如果 JavaScript 被禁用会发生什么(在这种情况下,没有结果)。然后,使用浏览器的 "inspector"、"console" 或 "developer tools"(通常通过按 F12 打开)看看会发生什么。在您的情况下,对 resultados.php 的 POST 请求已完成。我发现了 firefox、开发工具、"Network" 选项卡。您还可以在 POST 请求中找到相关参数。