提取 html 的特定部分

Question

我正在使用 html 请求和漂亮的汤（这是新手）开发网络爬虫。对于 1 个网页 (https://www.selfridges.com/GB/en/cat/beauty/make-up/?pn=1)，我正在尝试抓取一部分，我将复制它用于其他产品。 html 看起来像：

<div class="plp-listing-load-status c-list-header__counter initialized" data-page-number="1" data-total-pages-count="57" data-products-count="60" data-total-products-count="3361" data-status-format="{available}/{total} results">60/3361 results</div>

我想从 data-total-pages-count="57" 中抓取“57”。我试过使用：

soup = BeautifulSoup(page.content, "html.parser")
nopagesstr = soup.find(class_="plp-listing-load-status c-list-header__counter initialized").get('data-total-pages-count')

和

nopagesstr = r.html.find('[data-total-pages-count]',first=True)

但都returnNone。我不确定如何具体 select 57。任何帮助将不胜感激

Answer 1

要获取总页数，您可以使用此示例：

import requests
from bs4 import BeautifulSoup


url = "https://www.selfridges.com/GB/en/cat/beauty/make-up/?pn=1"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")
print(soup.select_one("[data-total-pages-count]")["data-total-pages-count"])

打印：

提取 html 的特定部分

Extracting specific part of html

html

python

beautifulsoup

web-scraping

python-requests-html