如何抓取隐藏在占位符后面的 href?
How can I scrape a href that is hidden behind a placeholder?
我正在尝试从网站上抓取以下 href。网站上有几个我打算抓取的 href,所以我循环浏览网站以便将它们全部存储在一个列表中。下面是其中一个 href 的示例。
<div class="col-md-4 h-gutter">
<div class="product box" data-productid="2111214">
<a href="/products/examples/product1/">
<h3>Product 1</h3>
<div class="product-small-text">
这是我的相关代码部分。注释掉的是我试图收集 hrefs 的尝试。由于这不起作用,现在我正试图抓取整个“col-md-4 h-gutter”
for product in soup.select('div.product.box'):
link.append(product)
#link.append(product.a['href'])
print(link)
以下是打印到终端的内容。如您所见,href 隐藏在占位符后面。
</div>, <div class="product placeholder-container box">
<h3><span class="placeholder-text--long"></span></h3>
<div class="product-small-text">
<span class="placeholder-text--short"></span>
</div>
如何打印出 href 的值?
使用 json 响应要容易得多。如果您需要 table 形式,只需将其输入 pandas:
import requests
import pandas as pd
url = 'https://www.masterofmalt.com/api/v2/lightningdeals/?isVatableCountry=1&deliveryCountryId=464&filter=nodrams&_=1617024330709&format=json'
headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
jsonData = requests.get(url, headers=headers).json()
df = pd.DataFrame(jsonData['lightningDeals'])
输出:43行中的前5行
print(df.head(5).to_string())
productUrl productImageUrl productRating productReviewCount productVolume productAbv categories endDateUtc productId productName dealPrice previousPrice timeRemaining saving percentageClaimed isActive dailyDeal
0 /whiskies/tobermory/tobermory-12-year-old-whisky/ /whiskies/p-IMAGEPRESET/tobermory/tobermory-12-year-old-whisky.jpg 5.0 17 70 46.3 [Whiskies, Single Malt] 2021-04-04T22:57:00.0000000 87989 Tobermory 12 Year Old 34.85 39.85 550379 5.0 0.669725 True False
1 /whiskies/elements-of-islay/peat-pure-islay-elements-of-islay-whisky/ /whiskies/p-IMAGEPRESET/elements-of-islay/peat-pure-islay-elements-of-islay-whisky.jpg 0.0 0 50 45.0 [Whiskies, Blended Malt] 2021-04-04T22:59:00.0000000 58061 Peat Pure Islay 23.94 28.94 550499 5.0 0.625000 True False
2 /mezcal/ilegal/ilegal-reposado-mezcal/ /mezcal/p-IMAGEPRESET/ilegal/ilegal-reposado-mezcal.jpg 5.0 3 70 40.0 [Mezcal, Reposado] 2021-04-04T22:59:00.0000000 9277 Ilegal Reposado 53.40 59.40 550499 6.0 0.500000 True False
3 /whiskies/nikka/nikka-coffey-grain-whisky-70cl/ /whiskies/p-IMAGEPRESET/nikka/nikka-coffey-grain-whisky-70cl.jpg 4.5 40 70 45.0 [Whiskies, Grain] 2021-04-04T22:57:00.0000000 32316 Nikka Coffey Grain 70cl 49.83 54.83 550379 5.0 0.410256 True False
4 /rum/satchmo/satchmo-mojito-spirited-rum/ /rum/p-IMAGEPRESET/satchmo/satchmo-mojito-spirited-rum.jpg 5.0 14 70 37.5 [Rum, Spiced] 2021-04-04T22:58:00.0000000 106576 Satchmo Rum 34.95 39.95 550439 5.0 0.338710 True False
我正在尝试从网站上抓取以下 href。网站上有几个我打算抓取的 href,所以我循环浏览网站以便将它们全部存储在一个列表中。下面是其中一个 href 的示例。
<div class="col-md-4 h-gutter">
<div class="product box" data-productid="2111214">
<a href="/products/examples/product1/">
<h3>Product 1</h3>
<div class="product-small-text">
这是我的相关代码部分。注释掉的是我试图收集 hrefs 的尝试。由于这不起作用,现在我正试图抓取整个“col-md-4 h-gutter”
for product in soup.select('div.product.box'):
link.append(product)
#link.append(product.a['href'])
print(link)
以下是打印到终端的内容。如您所见,href 隐藏在占位符后面。
</div>, <div class="product placeholder-container box">
<h3><span class="placeholder-text--long"></span></h3>
<div class="product-small-text">
<span class="placeholder-text--short"></span>
</div>
如何打印出 href 的值?
使用 json 响应要容易得多。如果您需要 table 形式,只需将其输入 pandas:
import requests
import pandas as pd
url = 'https://www.masterofmalt.com/api/v2/lightningdeals/?isVatableCountry=1&deliveryCountryId=464&filter=nodrams&_=1617024330709&format=json'
headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
jsonData = requests.get(url, headers=headers).json()
df = pd.DataFrame(jsonData['lightningDeals'])
输出:43行中的前5行
print(df.head(5).to_string())
productUrl productImageUrl productRating productReviewCount productVolume productAbv categories endDateUtc productId productName dealPrice previousPrice timeRemaining saving percentageClaimed isActive dailyDeal
0 /whiskies/tobermory/tobermory-12-year-old-whisky/ /whiskies/p-IMAGEPRESET/tobermory/tobermory-12-year-old-whisky.jpg 5.0 17 70 46.3 [Whiskies, Single Malt] 2021-04-04T22:57:00.0000000 87989 Tobermory 12 Year Old 34.85 39.85 550379 5.0 0.669725 True False
1 /whiskies/elements-of-islay/peat-pure-islay-elements-of-islay-whisky/ /whiskies/p-IMAGEPRESET/elements-of-islay/peat-pure-islay-elements-of-islay-whisky.jpg 0.0 0 50 45.0 [Whiskies, Blended Malt] 2021-04-04T22:59:00.0000000 58061 Peat Pure Islay 23.94 28.94 550499 5.0 0.625000 True False
2 /mezcal/ilegal/ilegal-reposado-mezcal/ /mezcal/p-IMAGEPRESET/ilegal/ilegal-reposado-mezcal.jpg 5.0 3 70 40.0 [Mezcal, Reposado] 2021-04-04T22:59:00.0000000 9277 Ilegal Reposado 53.40 59.40 550499 6.0 0.500000 True False
3 /whiskies/nikka/nikka-coffey-grain-whisky-70cl/ /whiskies/p-IMAGEPRESET/nikka/nikka-coffey-grain-whisky-70cl.jpg 4.5 40 70 45.0 [Whiskies, Grain] 2021-04-04T22:57:00.0000000 32316 Nikka Coffey Grain 70cl 49.83 54.83 550379 5.0 0.410256 True False
4 /rum/satchmo/satchmo-mojito-spirited-rum/ /rum/p-IMAGEPRESET/satchmo/satchmo-mojito-spirited-rum.jpg 5.0 14 70 37.5 [Rum, Spiced] 2021-04-04T22:58:00.0000000 106576 Satchmo Rum 34.95 39.95 550439 5.0 0.338710 True False