如何防止 Nokogiri 添加不必要的 HTML 标签?
How do I prevent Nokogiri from adding unnecessary HTML tags?
我正在使用 Rails 4.2.3。我有这段代码是为了获取 URL
的内容
doc = Nokogiri::HTML(open(url))
有时 URL 会 return HTML 但有时会 return JSON。我事先不知道。我注意到,当 URL returns JSON 时,Nokgiri 在其前面添加了所有这些 HTML 标签。这是浏览器中显示的内容:
{"list":[{"u":"1459808276_000001","i":"1459184695_000001","pid":"RDE8UZZZ”,”fname":"Alexi","lname”:”Jones”,”sex":"F","city":"Eugene","country":"US","country_iso":"us","course":"8k","class":"elite","race":"8K","name":"Alexi Jones”,”_ver":"14","tag":"0000001","bib":"1"}],"info":{"first":"1","last":"1","total":"1","cacheVer":"0~0"}}
然而,当我执行 Nokogiri 时,这是 returned:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>{"list":[{"u":"1459808276_000001","i":"1459184695_000001","pid":"RDE8UZZZ”,”fname":"Alexi","lname”:”Jones”,”sex":"F","city":"Eugene","country":"US","country_iso":"us","course":"8k","class":"elite","race":"8K","name":"Alexi Jones”,”_ver":"14","tag":"0000001","bib":"1"}],"info":{"first":"1","last":"1","total":"1","cacheVer":"0~0"}}</p></body></html>
如何防止 Nokogiri 添加额外的东西?我只是想让它 return 正是 return 发送给浏览器的内容。
当我按照另一个 SO 答案的建议尝试 doc = Nokogiri::HTML.fragment(open(url))
时,出现错误:
error: undefined method `strip' for #<StringIO:0x007ff8acb34c30>
Nokogiri 假设您已经确定您是否正在接收适当的解析内容。在将其传递给 Nokogiri 之前,由您自行检查。
不要使用
doc = Nokogiri::HTML(open(url))
您可以查看 "CONTENT-TYPE" 返回的 HTTP headers,JSON 响应应该是 "application/json","TEXT/HTML" HTML。 OpenURI documentation 有以下示例:
open("http://www.ruby-lang.org/en") {|f|
f.each_line {|line| p line}
p f.base_uri # <URI::HTTP:0x40e6ef2 URL:http://www.ruby-lang.org/en/>
p f.content_type # "text/html"
p f.charset # "iso-8859-1"
p f.content_encoding # []
p f.last_modified # Thu Dec 05 02:45:02 UTC 2002
}
或者,您可以查看返回的 body 的第一个字符,它会告诉您它是 HTML/XML 还是 JSON。前两个将以 <
开头,而 JSON 将以 [
或 {
开头。
这样的事情将是一个不错的开始:
content = open('http://www.example.com').read
if content.lstrip[0] == '<'
# it's XML/HTML so parse it with Nokogiri
else
# it's JSON so parse it with the JSON parser
end
我正在使用 Rails 4.2.3。我有这段代码是为了获取 URL
的内容 doc = Nokogiri::HTML(open(url))
有时 URL 会 return HTML 但有时会 return JSON。我事先不知道。我注意到,当 URL returns JSON 时,Nokgiri 在其前面添加了所有这些 HTML 标签。这是浏览器中显示的内容:
{"list":[{"u":"1459808276_000001","i":"1459184695_000001","pid":"RDE8UZZZ”,”fname":"Alexi","lname”:”Jones”,”sex":"F","city":"Eugene","country":"US","country_iso":"us","course":"8k","class":"elite","race":"8K","name":"Alexi Jones”,”_ver":"14","tag":"0000001","bib":"1"}],"info":{"first":"1","last":"1","total":"1","cacheVer":"0~0"}}
然而,当我执行 Nokogiri 时,这是 returned:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>{"list":[{"u":"1459808276_000001","i":"1459184695_000001","pid":"RDE8UZZZ”,”fname":"Alexi","lname”:”Jones”,”sex":"F","city":"Eugene","country":"US","country_iso":"us","course":"8k","class":"elite","race":"8K","name":"Alexi Jones”,”_ver":"14","tag":"0000001","bib":"1"}],"info":{"first":"1","last":"1","total":"1","cacheVer":"0~0"}}</p></body></html>
如何防止 Nokogiri 添加额外的东西?我只是想让它 return 正是 return 发送给浏览器的内容。
当我按照另一个 SO 答案的建议尝试 doc = Nokogiri::HTML.fragment(open(url))
时,出现错误:
error: undefined method `strip' for #<StringIO:0x007ff8acb34c30>
Nokogiri 假设您已经确定您是否正在接收适当的解析内容。在将其传递给 Nokogiri 之前,由您自行检查。
不要使用
doc = Nokogiri::HTML(open(url))
您可以查看 "CONTENT-TYPE" 返回的 HTTP headers,JSON 响应应该是 "application/json","TEXT/HTML" HTML。 OpenURI documentation 有以下示例:
open("http://www.ruby-lang.org/en") {|f|
f.each_line {|line| p line}
p f.base_uri # <URI::HTTP:0x40e6ef2 URL:http://www.ruby-lang.org/en/>
p f.content_type # "text/html"
p f.charset # "iso-8859-1"
p f.content_encoding # []
p f.last_modified # Thu Dec 05 02:45:02 UTC 2002
}
或者,您可以查看返回的 body 的第一个字符,它会告诉您它是 HTML/XML 还是 JSON。前两个将以 <
开头,而 JSON 将以 [
或 {
开头。
这样的事情将是一个不错的开始:
content = open('http://www.example.com').read
if content.lstrip[0] == '<'
# it's XML/HTML so parse it with Nokogiri
else
# it's JSON so parse it with the JSON parser
end