当文本不在 HTML 元素中时如何抓取文本

Question

我想抓取这个网站：https://www.hectorjones.co.nz/milwaukee-hand-tools-and-accessories.html 我想抓取 Product Sku、Price、List Price 元素。我设法抓取了 Price，但我在其他两个方面遇到了问题，尤其是 Product Sku，因为它不在 span 中。就在一个div里，能不能抓取呢？如果可以，你能帮帮我吗

如您所见，产品 Sku 没有跨度。

<div class="vm3pr-2"> <div class="product-price" id="productPrice1499">
<div class="product-sku"><span class="bold">Product SKU</span> : 2203-20<br></div>

这里是更多的代码。

<div class="vm3pr-2"> <div class="product-price" id="productPrice1499">
<div class="product-sku"><span class="bold">Product SKU</span> : 2203-20<br></div>
<div class="PricesalesPrice vm-display vm-price-value"><span class="vm-price-desc">Price (inc GST): 
</span><span class="PricesalesPrice">.00</span></div><span class="ex-tax"></span><div 
class="PricediscountAmount vm-nodisplay"><span class="vm-price-desc">Discount: </span><span 
class="PricediscountAmount"></span></div></div>

        <div class="clear"></div>
        </div>

这是我的代码

    prices = driver.find_elements_by_class_name("PricesalesPrice")
    sku = driver.find_elements_by_class_name("bold")
    list_price = driver.find_elements_by_class_name("PricebasePriceWithTax")

    for price in prices:
        print(price.text)

Answer 1

在这里您可以执行此操作以获取具有相应价格的产品-sku。我在用美汤刮痧....

views.py

import requests
from bs4 import BeautifulSoup
from django.shortcuts import render

base_url = 'https://www.hectorjones.co.nz/milwaukee-hand-tools-and-accessories.html'


def home(request):
    response = requests.get(base_url)
    data = response.text
    soup = BeautifulSoup(data, features='html.parser')
    post_listings = soup.find_all('div', {'class': 'product-price'})
    final_postings = []

    for post in post_listings:
        product_sku = post.find('div', {'class': 'product-sku'}).text
        price = post.find('span', {'class': 'PricesalesPrice'}).text
        final_postings.append((product_sku, price))
    context = {
        'final_postings': final_postings,
       }

    return render(request, 'display.html', context)

display.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>hectorjones.co.nz/</title>
</head>
<body>
{% for post in final_postings %}
    <ul>
         <li><p>{{ post.0 }}<br> Price : {{ post.1 }}
         </p></li>
    </ul>

{% endfor %}

</body>
</html>

Answer 2

嗨，Burak 之前编写的代码是在 django 中。正如您所问，这是您在 cmd 中运行所需要的代码，它将打印该网站上可用的产品列表。

首先确保安装这两个 python 软件包：

pip 安装请求

pip 安装 bs4

scraping.py

import requests
from bs4 import BeautifulSoup

base_url = 'https://www.hectorjones.co.nz/milwaukee-hand-tools-and-accessories.html'


response = requests.get(base_url)
data = response.text
soup = BeautifulSoup(data, features='html.parser')
post_listings = soup.find_all('div', {'class': 'product-price'})
final_postings = []

for post in post_listings:
    product_sku = post.find('div', {'class': 'product-sku'}).text
    price = post.find('span', {'class': 'PricesalesPrice'}).text
    final_postings.append((product_sku, price))

print(final_postings)

如果您对任何步骤感到困惑，请告诉我。快乐编码

当文本不在 HTML 元素中时如何抓取文本

How to scrape a text when it'ss not in a HTML element

html

python

data-mining

web-scraping