难以从网站上抓取产品信息
Difficulty Scrapping Product Information from Website
我无法从此网站删除“产品名称”和“价格”:https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571
希望从网页上删除“$4.30”和“Zespri New Zealand Kiwifruit - Green”。我尝试了各种方法(Beautiful Soup、request_html、selenium)但都没有成功。附上我采用的示例代码方法。
我可以在 Chrome 的“开发者工具”选项卡中查看 'price' 和 'product name' 详细信息。该网页似乎使用Javascript动态加载产品信息,因此上述各种方法均无法正确抓取信息。
感谢在此问题上的任何帮助。
Requests_html方法:
from requests_html import HTMLSession
import json
url='https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571'
session = HTMLSession()
r = session.get(url)
r.html.render(timeout=20)
json_text=r.html.xpath("//script[@type='application/ld+json']/text()")[0][:-1]
json_data = json.loads(json_text)
print(json_data['name']['price'])
靓汤做法:
import sys
import time
from bs4 import BeautifulSoup
import requests
import re
url='https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.69'}
page=requests.get(url, headers=headers)
time.sleep(2)
soup=BeautifulSoup(page.text,'html.parser')
linkitem=soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 djlKtC'})
print(linkitem)
linkprice=soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 sc-13n2dsm-5 kxEbZl deQJPo'})
print(linkprice)
硒方法:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
linkitem = soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 djlKtC'})
print(linkitem)
您使用嵌入式 JSON
的方法需要一些改进。换句话说,你快到了。此外,这可以用纯 requests
和 bs4
.
来完成
PS。我使用的是不同的 URLS,就像你给 returns 一个 404
.
方法如下:
import json
import requests
from bs4 import BeautifulSoup
urls = [
"https://www.fairprice.com.sg/product/11798142",
"https://www.fairprice.com.sg/product/vina-maipo-cabernet-sauvignon-merlot-750ml-11690254",
"https://www.fairprice.com.sg/product/new-moon-new-zealand-abalone-425g-75342",
]
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0",
}
for url in urls:
product_data = (
json.loads(
BeautifulSoup(requests.get(url, headers=headers).text, "lxml")
.find("script", type="application/ld+json")
.string[:-1]
)
)
print(product_data["name"])
print(product_data["offers"]["price"])
这应该输出:
Nongshim Instant Cup Noodle - Spicy
1.35
Vina Maipo Red Wine - Cabernet Sauvignon Merlot
14.95
New Moon New Zealand Abalone
33.8
我无法从此网站删除“产品名称”和“价格”:https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571
希望从网页上删除“$4.30”和“Zespri New Zealand Kiwifruit - Green”。我尝试了各种方法(Beautiful Soup、request_html、selenium)但都没有成功。附上我采用的示例代码方法。
我可以在 Chrome 的“开发者工具”选项卡中查看 'price' 和 'product name' 详细信息。该网页似乎使用Javascript动态加载产品信息,因此上述各种方法均无法正确抓取信息。
感谢在此问题上的任何帮助。
Requests_html方法:
from requests_html import HTMLSession
import json
url='https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571'
session = HTMLSession()
r = session.get(url)
r.html.render(timeout=20)
json_text=r.html.xpath("//script[@type='application/ld+json']/text()")[0][:-1]
json_data = json.loads(json_text)
print(json_data['name']['price'])
靓汤做法:
import sys
import time
from bs4 import BeautifulSoup
import requests
import re
url='https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.69'}
page=requests.get(url, headers=headers)
time.sleep(2)
soup=BeautifulSoup(page.text,'html.parser')
linkitem=soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 djlKtC'})
print(linkitem)
linkprice=soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 sc-13n2dsm-5 kxEbZl deQJPo'})
print(linkprice)
硒方法:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
linkitem = soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 djlKtC'})
print(linkitem)
您使用嵌入式 JSON
的方法需要一些改进。换句话说,你快到了。此外,这可以用纯 requests
和 bs4
.
PS。我使用的是不同的 URLS,就像你给 returns 一个 404
.
方法如下:
import json
import requests
from bs4 import BeautifulSoup
urls = [
"https://www.fairprice.com.sg/product/11798142",
"https://www.fairprice.com.sg/product/vina-maipo-cabernet-sauvignon-merlot-750ml-11690254",
"https://www.fairprice.com.sg/product/new-moon-new-zealand-abalone-425g-75342",
]
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0",
}
for url in urls:
product_data = (
json.loads(
BeautifulSoup(requests.get(url, headers=headers).text, "lxml")
.find("script", type="application/ld+json")
.string[:-1]
)
)
print(product_data["name"])
print(product_data["offers"]["price"])
这应该输出:
Nongshim Instant Cup Noodle - Spicy
1.35
Vina Maipo Red Wine - Cabernet Sauvignon Merlot
14.95
New Moon New Zealand Abalone
33.8