在 python 中使用 BeautifulSoup 进行网页抓取
Web Scraping by Using BeautifulSoup in python
我如何使用 json 模块从内联 JSON
格式的数据中提取价格 script
?
我试图在 https://glomark.lk/top-crust-bread/p/13676 中提取价格
但是我无法获取价格值。
所以请帮我解决这个问题。
import requests
import json
import sys
sys.path.insert(0,'bs4.zip')
from bs4 import BeautifulSoup
user_agent = {
'User-agent': 'Mozilla/5.0 Chrome/35.0.1916.47'
}
headers = user_agent
url = 'https://glomark.lk/top-crust-bread/p/13676'
req = requests.get(url, headers = headers)
soup = BeautifulSoup(req.content, 'html.parser')
products = soup.find_all("div", class_ = "details col-12 col-sm-12
col-md-6 col-lg-5 col-xl-5")
for product in products:
product_name = product.h1.text
product_price = product.find(id = 'product-promotion-price').text
print(product_name)
print(product_price)
您可以仅使用 requests
模块从隐藏的 api 中获取 json 数据(价格)。但是产品名称不是动态的。
import requests
headers= {
'content-type': 'application/json',
'x-requested-with': 'XMLHttpRequest'
}
api_url = "https://glomark.lk/product-page/variation-detail/13676"
jsonData = requests.post(api_url, headers=headers).json()
price=jsonData['price']
print(price)
输出:
95
完整的工作代码:
from bs4 import BeautifulSoup
import requests
headers= {
'content-type': 'application/json',
'x-requested-with': 'XMLHttpRequest'
}
api_url = "https://glomark.lk/product-page/variation-detail/13676"
jsonData = requests.post(api_url, headers=headers).json()
price=jsonData['price']
#to grab product name(not dynamic)
url = 'https://glomark.lk/top-crust-bread/p/13676'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
title=soup.select_one('.product-title h1').text
print(title)
print(price)
输出:
Top Crust Bread
95
如前所述,内容是由 JavaScript
动态提供的,因此其中一种方法可能是直接从脚本标签中获取数据,您已经在问题中找到了。
data = json.loads(soup.select_one('[type="application/ld+json"]').text)
会给你一个包含产品信息的字典:
{'@context': 'https://schema.org', '@type': 'Product', 'productID': '13676', 'name': 'Top Crust Bread', 'description': 'Top Crust Bread', 'url': '/top-crust-bread/p/13676', 'image': 'https://objectstorage.ap-mumbai-1.oraclecloud.com/n/softlogicbicloud/b/cdn/o/products/350001--01--1555692328.jpeg', 'brand': 'GLOMARK', 'offers': [{'@type': 'Offer', 'price': '95', 'priceCurrency': 'LKR', 'itemCondition': 'https://schema.org/NewCondition', 'availability': 'https://schema.org/InStock'}]}
只需选择价格等需要的信息:
data['offers'][0]['price']
例子
import requests, json
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://glomark.lk/top-crust-bread/p/13676'
response = requests.get(url)
soup = BeautifulSoup(response.content)
data = json.loads(soup.select_one('[type="application/ld+json"]').text)
product_price = data['offers'][0]['price']
product_name = data['name']
product_image = data['image']
print(product_name)
print(product_price)
print(product_image)
输出
Top Crust Bread
95
https://objectstorage.ap-mumbai-1.oraclecloud.com/n/softlogicbicloud/b/cdn/o/products/350001--01--1555692328.jpeg
我如何使用 json 模块从内联 JSON
格式的数据中提取价格 script
?
我试图在 https://glomark.lk/top-crust-bread/p/13676 中提取价格 但是我无法获取价格值。
所以请帮我解决这个问题。
import requests
import json
import sys
sys.path.insert(0,'bs4.zip')
from bs4 import BeautifulSoup
user_agent = {
'User-agent': 'Mozilla/5.0 Chrome/35.0.1916.47'
}
headers = user_agent
url = 'https://glomark.lk/top-crust-bread/p/13676'
req = requests.get(url, headers = headers)
soup = BeautifulSoup(req.content, 'html.parser')
products = soup.find_all("div", class_ = "details col-12 col-sm-12
col-md-6 col-lg-5 col-xl-5")
for product in products:
product_name = product.h1.text
product_price = product.find(id = 'product-promotion-price').text
print(product_name)
print(product_price)
您可以仅使用 requests
模块从隐藏的 api 中获取 json 数据(价格)。但是产品名称不是动态的。
import requests
headers= {
'content-type': 'application/json',
'x-requested-with': 'XMLHttpRequest'
}
api_url = "https://glomark.lk/product-page/variation-detail/13676"
jsonData = requests.post(api_url, headers=headers).json()
price=jsonData['price']
print(price)
输出:
95
完整的工作代码:
from bs4 import BeautifulSoup
import requests
headers= {
'content-type': 'application/json',
'x-requested-with': 'XMLHttpRequest'
}
api_url = "https://glomark.lk/product-page/variation-detail/13676"
jsonData = requests.post(api_url, headers=headers).json()
price=jsonData['price']
#to grab product name(not dynamic)
url = 'https://glomark.lk/top-crust-bread/p/13676'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
title=soup.select_one('.product-title h1').text
print(title)
print(price)
输出:
Top Crust Bread
95
如前所述,内容是由 JavaScript
动态提供的,因此其中一种方法可能是直接从脚本标签中获取数据,您已经在问题中找到了。
data = json.loads(soup.select_one('[type="application/ld+json"]').text)
会给你一个包含产品信息的字典:
{'@context': 'https://schema.org', '@type': 'Product', 'productID': '13676', 'name': 'Top Crust Bread', 'description': 'Top Crust Bread', 'url': '/top-crust-bread/p/13676', 'image': 'https://objectstorage.ap-mumbai-1.oraclecloud.com/n/softlogicbicloud/b/cdn/o/products/350001--01--1555692328.jpeg', 'brand': 'GLOMARK', 'offers': [{'@type': 'Offer', 'price': '95', 'priceCurrency': 'LKR', 'itemCondition': 'https://schema.org/NewCondition', 'availability': 'https://schema.org/InStock'}]}
只需选择价格等需要的信息:
data['offers'][0]['price']
例子
import requests, json
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://glomark.lk/top-crust-bread/p/13676'
response = requests.get(url)
soup = BeautifulSoup(response.content)
data = json.loads(soup.select_one('[type="application/ld+json"]').text)
product_price = data['offers'][0]['price']
product_name = data['name']
product_image = data['image']
print(product_name)
print(product_price)
print(product_image)
输出
Top Crust Bread
95
https://objectstorage.ap-mumbai-1.oraclecloud.com/n/softlogicbicloud/b/cdn/o/products/350001--01--1555692328.jpeg