如何使用Mechanize解析本地文件

Question

我正在使用 Ruby 和 Mechanize 来解析本地 HTML 文件，但我做不到。如果我使用 URL，这会起作用：

agent = Mechanize.new
#THIS WORKS
#url = 'http://www.sample.com/sample.htm'
#page = agent.get(url) #this seems to work just fine but the following below doesn't

#THIS FAILS
file = File.read('/home/user/files/sample.htm') #this is a regular html file
page = Nokogiri::HTML(file)
pp page.body #errors here

page.search('/div[@class="product_name"]').each do |node|
  text = node.text  
  puts "product name: " + text.to_s
end

错误是：

/home/user/code/myapp/app/models/program.rb:35:in `main': undefined method `body' for #<Nokogiri::HTML::Document:0x000000011552b0> (NoMethodError)

如何获取页面对象以便在其上进行搜索？

Answer 1

Mechanize 使用 URI 字符串指向它应该解析的内容。通常我们会使用“http”或“https”方案来指向网络服务器，这就是 Mechanize 的优势所在，但其他方案也可用，包括“file” "，可用于加载本地文件。

我的桌面上有一个名为“test.rb”的小 HTML 文件：

<!DOCTYPE html>
<html>
<head></head>
<body>
<p>
Hello World!
</p>
</body>
</html>

运行此代码：

require 'mechanize'

agent = Mechanize.new
page = agent.get('file:/Users/ttm/Desktop/test.html')
puts page.body

输出：

<!DOCTYPE html>
<html>
<head></head>
<body>
<p>
Hello World!
</p>
</body>
</html>

这告诉我 Mechanize 加载了文件，解析了它，然后访问了 body。

但是，除非您需要实际操作表单 and/or 导航页面，否则 Mechanize 可能不是您想要使用的。相反，在 Mechanize 下的 Nokogiri 是解析、提取数据或操作标记的更好选择，它不知道使用什么方案或文件实际位于何处：

require 'nokogiri'

doc = Nokogiri::HTML(File.read('/Users/ttm/Desktop/test.html'))
puts doc.to_html

然后在解析后输出相同的文件。

回到你的问题，如何只使用Nokogiri找到节点：

将 test.html 更改为：

<!DOCTYPE html>
<html>
<head></head>
<body>
<div class="product_name">Hello World!</div>
</body>
</html>

和运行：

require 'nokogiri'

doc = Nokogiri::HTML(File.read('/Users/ttm/Desktop/test.html'))
doc.search('div.product_name').map(&:text)
# => ["Hello World!"]

表明 Nokogiri 找到了节点并返回了文本。

您示例中的这段代码可能会更好：

text = node.text  
puts "product name: " + text.to_s

node.text returns 一个字符串：

doc = Nokogiri::HTML('<p>hello world!</p>')
doc.at('p').text # => "hello world!"
doc.at('p').text.class # => String

所以text.to_s是多余的。只需使用 text.

如何使用Mechanize解析本地文件

How to use Mechanize to parse local file

ruby

mechanize

nokogiri

web-scraping

mechanize-ruby