如何抓取隐藏在占位符后面的 href？

Question

我正在尝试从网站上抓取以下 href。网站上有几个我打算抓取的 href，所以我循环浏览网站以便将它们全部存储在一个列表中。下面是其中一个 href 的示例。

<div class="col-md-4 h-gutter">
   <div class="product box" data-productid="2111214"> 
      <a href="/products/examples/product1/"> 
         <h3>Product 1</h3> 
         <div class="product-small-text">

这是我的相关代码部分。注释掉的是我试图收集 hrefs 的尝试。由于这不起作用，现在我正试图抓取整个“col-md-4 h-gutter”

for product in soup.select('div.product.box'):
    link.append(product)
    #link.append(product.a['href'])

print(link)

以下是打印到终端的内容。如您所见，href 隐藏在占位符后面。

</div>, <div class="product placeholder-container box"> 
<h3><span class="placeholder-text--long"></span></h3> 
<div class="product-small-text"> 
<span class="placeholder-text--short"></span> 
</div>

如何打印出 href 的值？

Answer 1

使用 json 响应要容易得多。如果您需要 table 形式，只需将其输入 pandas:

import requests
import pandas as pd

url = 'https://www.masterofmalt.com/api/v2/lightningdeals/?isVatableCountry=1&deliveryCountryId=464&filter=nodrams&_=1617024330709&format=json'
headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
jsonData = requests.get(url, headers=headers).json()

df = pd.DataFrame(jsonData['lightningDeals'])

输出：43行中的前5行

print(df.head(5).to_string())
                                                              productUrl                                                                         productImageUrl  productRating  productReviewCount  productVolume  productAbv                categories                   endDateUtc  productId              productName  dealPrice  previousPrice  timeRemaining  saving  percentageClaimed  isActive  dailyDeal
0                      /whiskies/tobermory/tobermory-12-year-old-whisky/                      /whiskies/p-IMAGEPRESET/tobermory/tobermory-12-year-old-whisky.jpg            5.0                  17             70        46.3   [Whiskies, Single Malt]  2021-04-04T22:57:00.0000000      87989    Tobermory 12 Year Old      34.85          39.85         550379     5.0           0.669725      True      False
1  /whiskies/elements-of-islay/peat-pure-islay-elements-of-islay-whisky/  /whiskies/p-IMAGEPRESET/elements-of-islay/peat-pure-islay-elements-of-islay-whisky.jpg            0.0                   0             50        45.0  [Whiskies, Blended Malt]  2021-04-04T22:59:00.0000000      58061          Peat Pure Islay      23.94          28.94         550499     5.0           0.625000      True      False
2                                 /mezcal/ilegal/ilegal-reposado-mezcal/                                 /mezcal/p-IMAGEPRESET/ilegal/ilegal-reposado-mezcal.jpg            5.0                   3             70        40.0        [Mezcal, Reposado]  2021-04-04T22:59:00.0000000       9277          Ilegal Reposado      53.40          59.40         550499     6.0           0.500000      True      False
3                        /whiskies/nikka/nikka-coffey-grain-whisky-70cl/                        /whiskies/p-IMAGEPRESET/nikka/nikka-coffey-grain-whisky-70cl.jpg            4.5                  40             70        45.0         [Whiskies, Grain]  2021-04-04T22:57:00.0000000      32316  Nikka Coffey Grain 70cl      49.83          54.83         550379     5.0           0.410256      True      False
4                              /rum/satchmo/satchmo-mojito-spirited-rum/                              /rum/p-IMAGEPRESET/satchmo/satchmo-mojito-spirited-rum.jpg            5.0                  14             70        37.5             [Rum, Spiced]  2021-04-04T22:58:00.0000000     106576              Satchmo Rum      34.95          39.95         550439     5.0           0.338710      True      False

如何抓取隐藏在占位符后面的 href？

How can I scrape a href that is hidden behind a placeholder?

python

beautifulsoup

web-scraping

python-3.x

web-scraping-language