如何使用 Mechanize 或 Nokogiri 从字符串中解析表单

How to parse forms with Mechanize or Nokogiri from a string

我需要解析表单以获取 `IW_SessionID_from 的值 HTML 我回来了,但我无法开始工作。

#!/usr/bin/ruby

require 'pp'
require 'nokogiri'
require 'mechanize'

r = '<HTML><HEAD><TITLE></TITLE><meta http-equiv=\"cache-control\" content=\"no-cache\">\r\n<meta http-equiv=\"pragma\" content=\"no-cache\">\r\n<NOSCRIPT><HTML><BODY>Your browser does not seem to support JavaScript. Please make sure it is supported and activated</BODY></HTML></NOSCRIPT>\r\n<SCRIPT>\r\nvar ie4 = (document.all)? true:false;\r\nvar ns6 = (document.getElementById)? true && !ie4:false;\r\nfunction Initialize() {\r\nvar lWidth;\r\nvar lHeight;\r\nif (ns6) {\r\n  lWidth = window.innerWidth - 30;\r\n  lHeight = window.innerHeight - 30;\r\n} else {\r\n   lWidth = document.body.clientWidth;\r\n   lHeight = document.body.clientHeight;\r\n   if (lWidth == 0) { lWidth = undefined;}\r\n   if (lHeight == 0) { lHeight = undefined;}\r\n}\r\ndocument.forms[0].elements[\"IW_width\"].value = lWidth;\r\ndocument.forms[0].elements[\"IW_height\"].value = lHeight;\r\ndocument.forms[0].submit();\r\n}</SCRIPT></HEAD><BODY onload=\"Initialize()\">\r\n<form method=post action=\"/bwtem\">\r\n<input type=hidden name=\"IW_width\">\r\n<input type=hidden name=\"IW_height\">\r\n<input type=hidden name=\"IW_SessionID_\" value=\"1wqzj1f0vec57r1apfqg51wzs88c\">\r\n<input type=hidden name=\"IW_TrackID_\" value=\"0\">\r\n</form></BODY></HTML>'

page = Nokogiri::HTML r
puts page.css('form[name="IW_SessionID_"]')

a = Mechanize.new
page2 = Mechanize::Page.new(nil,{'content-type'=>'text/html'},r,nil,a)

pp page2.form_with(:name => "IW_SessionID_")

脚本只是 returns nil.

谁能知道如何获得 IW_SessionID_ 的值?

您必须取消转义示例 HTML 字符串,然后使用名称 IW_SessionID_.

搜索 input

这个示例代码适合我:

#!/usr/bin/ruby

require 'pp'
require 'nokogiri'
require 'mechanize'

r = '<HTML><HEAD><TITLE></TITLE><meta http-equiv="cache-control" content="no-cache">\r\n<meta http-equiv="pragma" content="no-cache">\r\n<NOSCRIPT><HTML><BODY>Your browser does not seem to support JavaScript. Please make sure it is supported and activated</BODY></HTML></NOSCRIPT>\r\n<SCRIPT>\r\nvar ie4 = (document.all)? true:false;\r\nvar ns6 = (document.getElementById)? true && !ie4:false;\r\nfunction Initialize() {\r\nvar lWidth;\r\nvar lHeight;\r\nif (ns6) {\r\n  lWidth = window.innerWidth - 30;\r\n  lHeight = window.innerHeight - 30;\r\n} else {\r\n   lWidth = document.body.clientWidth;\r\n   lHeight = document.body.clientHeight;\r\n   if (lWidth == 0) { lWidth = undefined;}\r\n   if (lHeight == 0) { lHeight = undefined;}\r\n}\r\ndocument.forms[0].elements["IW_width"].value = lWidth;\r\ndocument.forms[0].elements["IW_height"].value = lHeight;\r\ndocument.forms[0].submit();\r\n}</SCRIPT></HEAD><BODY onload="Initialize()">\r\n<form method=post action="/bwtem">\r\n<input type=hidden name="IW_width">\r\n<input type=hidden name="IW_height">\r\n<input type=hidden name="IW_SessionID_" value="1wqzj1f0vec57r1apfqg51wzs88c">\r\n<input type=hidden name="IW_TrackID_" value="0">\r\n</form></BODY></HTML>'

page = Nokogiri::HTML r
input = page.css('input[name="IW_SessionID_"]').first
puts input[:value]

熟悉工具后,操作起来很容易:

require 'nokogiri'

doc = Nokogiri::HTML(DATA.read)

doc.at('input[name="IW_SessionID_"]')['value']
# => "1wqzj1f0vec57r1apfqg51wzs88c"

__END__
<HTML>
  <BODY>
    <form method=post action="/bwtem">
      <input type=hidden name="IW_height">
      <input type=hidden name="IW_SessionID_" value="1wqzj1f0vec57r1apfqg51wzs88c">
      <input type=hidden name="IW_TrackID_" value="0">
    </form>
  </BODY>
</HTML>

不要做这样的事情:

page.css('form[name="IW_SessionID_"]')

css用于搜索匹配选择器的多个元素。一个表单不太可能有多个同名的隐藏输入,所以 at 会更明智。 css returns 一个 NodeSet,类似于节点数组,因此不像节点:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p>foo</p>
    <p>bar</p>
  </body>
</html>
EOT

doc.search('p').class # => Nokogiri::XML::NodeSet
doc.at('p').class # => Nokogiri::XML::Element

text 将连接 NodeSet 中的文本元素,导致一团糟:

doc.search('p').text # => "foobar"

而使用 map(&:text) 将遍历返回其文本的节点:

doc.search('p').map(&:text) # => ["foo", "bar"]

另请注意 css(...).firstsearch(...).firstat 或其 at_* 兄弟姐妹之一相同:

doc.search('p').first.to_html # => "<p>foo</p>"
doc.at('p').to_html # => "<p>foo</p>"

因此,为了清楚起见,请使用 at 而不是 search(...).first

最后,将您的 HTML 示例剥离到演示您所询问的问题所需的最低限度。超出此范围的任何事情都会浪费 space 和我们试图理解问题的时间。