如何使用特定 class 和特定文本在 <span> 标记后的 HTML 代码的下一行获取字符串？

Question

我正试图从一些电子商务网站上抓取一些产品规格。所以我有一个各种产品的 URL 列表，我需要我的代码去每个（这部分很简单）并抓取我需要的产品规格。我一直在尝试使用 ParseHub——它适用于某些链接，但不适用于其他链接。例如，我怀疑 'Wheel diameter' 每次都会更改其位置，因此最终会获取错误的规范值。

其中一个这样的部分，例如，在 HTML 中看起来像这样：

<div class="product-detail product-detail-custom-field">
          <span class="product-detail-key">Wheel Diameter</span>
          <span data-product-custom-field="">8 Inches</span>
        </div>

我认为我可以做的是如果我使用 BeautifulSoup 并且如果我能以某种方式使用

if soup.find("span", class_ = "product-detail-key").text.strip()=="Wheel Diameter":
                *go to the next line and grab the string inside*

我该如何编码？如果我的问题听起来很愚蠢，我真的很抱歉，请原谅我的无知，我是网络抓取的新手。

Answer 1

您可以使用.find_next()函数：

from bs4 import BeautifulSoup

html_doc = """
<div class="product-detail product-detail-custom-field">
  <span class="product-detail-key">Wheel Diameter</span>
  <span data-product-custom-field="">8 Inches</span>
</div>
"""

soup = BeautifulSoup(html_doc, "html.parser")

diameter = soup.find("span", text="Wheel Diameter").find_next("span").text
print(diameter)

打印：

8 Inches

或使用 CSS 选择器与 +:

diameter = soup.select_one('.product-detail-key:-soup-contains("Wheel Diameter") + *').text

Answer 2

使用 css selectors 你可以简单地链接/组合你的 selection 更严格。在这种情况下，您 select <span> 包含您的字符串并使用 adjacent sibling combinator 获取下一个兄弟 <span>.

diameter = soup.select_one('.product-detail-key:-soup-contains("Wheel Diameter") + span').text

或

diameter = soup.select_one('span.product-detail-key:-soup-contains("Wheel Diameter") + span').text

注意：为了避免AttributeError: 'NoneType' object has no attribute 'text'，如果元素不可用你可以在调用text方法之前检查它是否存在：

diameter = e.text if (e := soup.select_one('.product-detail-key:-soup-contains("Wheel Diameter") + span')) else None

例子

from bs4 import BeautifulSoup

html_doc = """
<div class="product-detail product-detail-custom-field">
  <span class="product-detail-key">Wheel Diameter</span>
  <span data-product-custom-field="">8 Inches</span>
</div>
"""

soup = BeautifulSoup(html_doc, "html.parser")

diameter = e.text if (e := soup.select_one('.product-detail-key:-soup-contains("Wheel Diameter") + span')) else None

如何使用特定 class 和特定文本在 <span> 标记后的 HTML 代码的下一行获取字符串？

How do I grab the string on the next line in HTML code following <span> tag with specific class and specific text?

html

python

beautifulsoup

web-scraping

parsehub

例子