使用 Nokogiri 查找带有文本的 link 时如何避免 "Invalid byte sequence"
How to avoid "Invalid byte sequence" when looking for link with text using Nokogiri
我正在使用 Rails 5 和 Ruby 4.2 并扫描我用 Nokogiri 解析的文档,以不区分大小写的方式查找带有文本的 link:
a_elt = doc ? doc.xpath('//a').detect { |node| /link[[:space:]]+text/i === node.text } : nil
在 content
中获取我的网页的 HTML 后,我使用以下方法将其解析为 Nokogiri 文档:
doc = Nokogiri::HTML(content)
问题是,我得到了
ArgumentError invalid byte sequence in UTF-8
在某些网页上使用上述正则表达式时。
2.4.0 :002 > doc.encoding
=> "UTF-8"
2.4.0 :003 > doc.xpath('//a').detect { |node| /individual[[:space:]]+results/i === node.text }
ArgumentError: invalid byte sequence in UTF-8
from (irb):3:in `==='
from (irb):3:in `block in irb_binding'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:187:in `block in each'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:186:in `upto'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:186:in `each'
from (irb):3:in `detect'
from (irb):3
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console.rb:65:in `start'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console_helper.rb:9:in `start'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:78:in `console'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:49:in `run_command!'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands.rb:18:in `<top (required)>'
from bin/rails:4:in `require'
from bin/rails:4:in `<main>'
有没有一种方法可以重写上面的内容以自动考虑编码或奇怪的字符而不是翻转?
您的问题之前可能已经有人回答过。您是否尝试过“Is there any way to clean a file of "invalid byte sequence in UTF-8" errors in Ruby?”中的方法?
特别是在 detect
块之前,尝试删除除换行之外的无效字节和控制字符:
doc.scrub!("")
doc.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")
请记住,scrub!
是 Ruby 2.1+ method。
我正在使用 Rails 5 和 Ruby 4.2 并扫描我用 Nokogiri 解析的文档,以不区分大小写的方式查找带有文本的 link:
a_elt = doc ? doc.xpath('//a').detect { |node| /link[[:space:]]+text/i === node.text } : nil
在 content
中获取我的网页的 HTML 后,我使用以下方法将其解析为 Nokogiri 文档:
doc = Nokogiri::HTML(content)
问题是,我得到了
ArgumentError invalid byte sequence in UTF-8
在某些网页上使用上述正则表达式时。
2.4.0 :002 > doc.encoding
=> "UTF-8"
2.4.0 :003 > doc.xpath('//a').detect { |node| /individual[[:space:]]+results/i === node.text }
ArgumentError: invalid byte sequence in UTF-8
from (irb):3:in `==='
from (irb):3:in `block in irb_binding'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:187:in `block in each'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:186:in `upto'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:186:in `each'
from (irb):3:in `detect'
from (irb):3
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console.rb:65:in `start'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console_helper.rb:9:in `start'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:78:in `console'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:49:in `run_command!'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands.rb:18:in `<top (required)>'
from bin/rails:4:in `require'
from bin/rails:4:in `<main>'
有没有一种方法可以重写上面的内容以自动考虑编码或奇怪的字符而不是翻转?
您的问题之前可能已经有人回答过。您是否尝试过“Is there any way to clean a file of "invalid byte sequence in UTF-8" errors in Ruby?”中的方法?
特别是在 detect
块之前,尝试删除除换行之外的无效字节和控制字符:
doc.scrub!("")
doc.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")
请记住,scrub!
是 Ruby 2.1+ method。