在 python 中使用 BeautifulSoup 进行网页抓取

Question

我如何使用 json 模块从内联 JSON 格式的数据中提取价格 script？

我试图在 https://glomark.lk/top-crust-bread/p/13676 中提取价格但是我无法获取价格值。

所以请帮我解决这个问题。

import requests
import json

import sys
sys.path.insert(0,'bs4.zip')
from bs4 import BeautifulSoup

user_agent = {
                 'User-agent': 'Mozilla/5.0 Chrome/35.0.1916.47'
                 }
headers = user_agent

url = 'https://glomark.lk/top-crust-bread/p/13676'
req = requests.get(url, headers = headers)
soup = BeautifulSoup(req.content, 'html.parser')

products = soup.find_all("div", class_ = "details col-12 col-sm-12 
col-md-6 col-lg-5 col-xl-5")
for product in products:
    product_name = product.h1.text
    product_price = product.find(id = 'product-promotion-price').text
    print(product_name)
    print(product_price)

Answer 1

您可以仅使用 requests 模块从隐藏的 api 中获取 json 数据（价格）。但是产品名称不是动态的。

import requests
headers= {
    'content-type': 'application/json',
    'x-requested-with': 'XMLHttpRequest'
   }

api_url = "https://glomark.lk/product-page/variation-detail/13676"


jsonData = requests.post(api_url,  headers=headers).json()

price=jsonData['price']
print(price)

输出：

完整的工作代码：

from bs4 import BeautifulSoup
import requests
headers= {
    'content-type': 'application/json',
    'x-requested-with': 'XMLHttpRequest'
   }

api_url = "https://glomark.lk/product-page/variation-detail/13676"


jsonData = requests.post(api_url,  headers=headers).json()

price=jsonData['price']



#to grab product name(not dynamic)

url = 'https://glomark.lk/top-crust-bread/p/13676'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')

title=soup.select_one('.product-title h1').text
print(title)
print(price)

输出：

Top Crust Bread
95

Answer 2

如前所述，内容是由 JavaScript 动态提供的，因此其中一种方法可能是直接从脚本标签中获取数据，您已经在问题中找到了。

data = json.loads(soup.select_one('[type="application/ld+json"]').text)

会给你一个包含产品信息的字典：

{'@context': 'https://schema.org', '@type': 'Product', 'productID': '13676', 'name': 'Top Crust Bread', 'description': 'Top Crust Bread', 'url': '/top-crust-bread/p/13676', 'image': 'https://objectstorage.ap-mumbai-1.oraclecloud.com/n/softlogicbicloud/b/cdn/o/products/350001--01--1555692328.jpeg', 'brand': 'GLOMARK', 'offers': [{'@type': 'Offer', 'price': '95', 'priceCurrency': 'LKR', 'itemCondition': 'https://schema.org/NewCondition', 'availability': 'https://schema.org/InStock'}]}

只需选择价格等需要的信息：

data['offers'][0]['price']

例子

import requests, json
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://glomark.lk/top-crust-bread/p/13676'
response = requests.get(url)
soup = BeautifulSoup(response.content)

data = json.loads(soup.select_one('[type="application/ld+json"]').text)

product_price = data['offers'][0]['price']
product_name = data['name']
product_image = data['image']

print(product_name)
print(product_price)
print(product_image)

输出

Top Crust Bread 
95 
https://objectstorage.ap-mumbai-1.oraclecloud.com/n/softlogicbicloud/b/cdn/o/products/350001--01--1555692328.jpeg

在 python 中使用 BeautifulSoup 进行网页抓取

Web Scraping by Using BeautifulSoup in python

python

beautifulsoup

web-scraping

single-page-application

例子

输出