使用 Mechanize/Nogokiri 按文本搜索

Question

我正在尝试从许多与此类似的页面中抓取一些关于平均 GPA 和更多的数据：

http://www.ptcas.org/ptcas/public/Listing.aspx?seqn=3200&navid=10737426783

require 'mechanize'

agent = Mechanize.new
page = agent.get('http://www.ptcas.org/ptcas/public/Listing.aspx?seqn=3200&navid=10737426783')
gpa_headers = page.xpath('//h3[contains(text(), "GPA")]')
pp gpa_headers

我的问题是 gpa_headers 为 nil 但至少有一个 h3 元素包含 "GPA".

是什么导致了这个问题？我想可能是因为页面有动态元素，Mechanize 对此有一些问题，但我可以 puts page.body 并且输出包括：

... <h3 style="text-align:center;">GPA REQUIREMENT</h3> ...

根据我的理解，我使用的xpath应该可以找到。

如果有更好的方法，我也想知道。

Answer 1

这看起来是站点 DOM 结构的问题，因为它包含一个名为 style 的标签，该标签没有被关闭，看起来像这样：

<td colspan='7'><style='text-align:center;font-style:italic'>The
institution has been granted Candidate for Accreditation status by the
Commission on Accreditation in Physical Therapy Education (1111 North
Fairfax Street, Alexandria, VA, 22314; phone: 703.706.3245; email: <a
href='mailto:accreditation@apta.org'>accreditation@apta.org</a>).
Candidacy is not an accreditation status nor does it assure eventual
accreditation. Candidate for Accreditation is a pre-accreditation
status of affiliation with the Commission on Accreditation in Physical
Therapy Education that indicates the program is progressing toward
accreditation.</td>

如您所见，td 标签关闭但内部 style 从未关闭。

如果您不需要这部分代码，我建议您在尝试使用整个 response 之前删除它。我没有使用 ruby 的经验，但我会做类似的事情：

获取响应的原始正文。
将与此正则表达式'(<style=\'.*)</td>'匹配的部分替换为空字符串，或自行关闭标签。
使用这个新的响应主体。

现在您可以使用 xpath 选择器了。

Answer 2

eLRuLL 给出了上述问题的根源。这是我如何解决问题的示例：

require 'mechanize'
require 'nokogiri'

agent = Mechanize.new
page = agent.get('http://www.ptcas.org/ptcas/public/Listing.aspx?seqn=3200&navid=10737426783')
mangled_text = page.body
fixed_text = mangled_text.sub(/<style=.+?<\/td>/, "</td>")
page = Nokogiri::HTML(fixed_text)
gpa_headers = page.xpath('//h3[contains(text(), "GPA")]')
pp gpa_headers

这将 return 我在上面寻找的 header:

[#<Nokogiri::XML::Element:0x2b28a8ec0c38 name="h3" attributes=[#<Nokogiri::XML::Attr:0x2b28a8ec0bc0 name="style" value="text-align:center;">] children=[#<Nokogiri::XML::Text:0x2b28a8ec0774 "GPA REQUIREMENT">]>]

Answer 3

更可靠的解决方案是使用像 nokogumbo 这样的 HTML5 解析器：

require 'nokogumbo'
doc = Nokogiri::HTML5(page.body)
gpa_headers = doc.search('//h3[contains(text(), "GPA")]')

使用 Mechanize/Nogokiri 按文本搜索

Searching by text with Mechanize/Nogokiri

ruby

mechanize

nokogiri

web-scraping