如何通过Nokogiri获取页面上特定的可视字符串

Question

目前，我能够使用 Nokogiri 解析网站并从页面中获取特定元素。但是，我需要能够获取用户可见的特定字符串，例如 "Out of stock"：

page.text.match('Out of stock')

这对于获取正确的字符串和 return 如果字符串存在或不存在则为真或假非常有效，但是，一些 link 如下所示，return 即使商品没有缺货也是如此，因为该特定字符串隐藏在页面上的脚本标签中：

https://www.walmart.com/ip/Funyuns-Onion-Flavored-Rings-6-oz/36915849?athcpid=36915849&athpgid=athenaItemPage&athcgid=null&athznid=PWSFM&athieid=v0&athstid=CS020&athguid=ba634528-888-172187cc96a580&athancid=null&athena=true

我正在寻找一种方法，以便当且仅当该字符串对用户可见时才会被拉出，因此上面应该 return false 匹配 "Out of stock" 字符串，而 [=下面的 23=] 应该 return 为真（在发布时），因为该商品实际上缺货。

https://www.walmart.com/ip/4-Pack-Chesters-Flamin-Hot-Popcorn-4-25-oz/737202470?selected=true

我也知道我可以抓取包含该字符串的特定标签，但我需要监控数百个网站，因此解决方案必须是广泛搜索可见字符串。

Answer 1

简答：我们可以使用 xpath 更具体的语法。

长话短说：我强烈建议使用 css-类来更具体一些，因为在某些情况下，我们不仅可以在 "script tag" 中获取此文本，还可以通过媒体查询或项目预览获取此文本块或其他任何东西，并将常见情况作为大块处理，但不要强制对所有情况使用一种特定的解决方案，以防出现意外行为

所以我们要具体一点，用"target-tags"来处理，例如：

Nokogiri::HTML.parse(page.html).xpath("//*[contains(@class, 'prod-PriceSection')]//*[contains(@class, 'prod-ProductOffer-oosMsg')]").text
"Out of stock"

所以，"to monitor hundreds of websites" 我们可以采用这种方法：

xpath("//*[contains(@class, 'PriceSection')]").text

或者更好的是使用这样的东西来确保元素是可见的：

page.all("//body//*[contains(text(), 'Out of stock')]", visible: true).count
# => 1

如果 Capybara 的另一个请求（在以前的解决方案中）的使用可能会成为问题，我们可以遵循这个解决方案，它更快：

xpath("//body//*[not(self::script) and contains(text(), 'Out of stock')]").count

希望对您有所帮助

如何通过Nokogiri获取页面上特定的可视字符串

How to get a specific viewable string on page through Nokogiri

ruby

html-parsing

nokogiri

capybara