BeautifulSoup scrape - 检索产品列表失败
BeautifulSoup scrape - Fail to retrieve product list
我之所以联系,是因为我在调整一段代码时遇到了一些麻烦,这段代码应该从亚马逊产品页面(标题、url、产品名称等)中抓取一些信息。抓取训练的经典内容:)
所以我基本上是通过不同的函数来写的:
- 一个函数生成 URL 以抓取
- 一个用于在不同元素之间导航并提取值的函数
最后我只是 运行 我的 driver & beautifulsoup & 启动了这两个功能。
然而,结果并不是我所期望的。我想检索一个有组织的 csv 文件,每个产品检索 1 行,并将每个相关信息放入列中。尽管如此,我总是以 1 或 2 行结束,但不是所有页面的所有产品。
我认为这是我的汤和“for 循环”的结果,它没有正确地遍历所有项目(尽管我无法弄清楚到底是什么)。
我想听听你对此的看法,你有什么线索吗?
非常感谢您的帮助
from bs4 import BeautifulSoup
from selenium import webdriver
import csv
#Function to generate URL with search KW & page nb
def get_url(search_term,page):
template = 'https://www.amazon.co.uk/s?k={}&page='+str(page)
search_term = search_term.replace(' ','+')
url = template.format(search_term)
return url
#Function to retrieve all data from the page
def extract_record(item):
atag = item.h2.a
#Retrieve product name
description = atag.text.strip()
#Retrieve product URL
url = 'https://www.amazon.co.uk' + atag.get('href')
#Retrieve sponsored status
try:
sponso_parent = item.find('span','s-label-popover-default')
sponso = sponso_parent.find('span', {'class': 'a-size-mini a-color-secondary', 'dir': 'auto'}).text
except AttributeError:
sponso = 'No'
#Retrieve price info
try:
price_parent = item.find('span','a-price')
price = price_parent.find('span','a-offscreen').text
except AttributeError:
return
#Retrieve avg product rating
try:
rating = item.i.text
except AttributeError:
rating = ''
#Retrieve review count (if monetary value, nill it due to missing value)
try:
review_count = item.find('span', {'class': 'a-size-base', 'dir': 'auto'}).text
except AttributeError:
review_count = ''
if "£" in review_count or "€" in review_count or "$" in review_count:
review_count = 0
result = (url, description, sponso, price, rating, review_count)
return result
record_final = []
#Loop through page nb
for page in range(1,3):
url = get_url('laptop',page)
print(url)
#Instantiate web driver & retrieve page content with BS (then loop through every product)
driver = webdriver.Chrome('\Users\rapha\Desktop\chromedriver.exe')
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
final_soup = soup.find_all('div',{'data-component-type': 's-search-result'})
try:
for item in final_soup:
record = extract_record(item)
if record:
record_final.append(record)
except AttributeError:
print('error_record')
driver.close()
with open('resultsamz.csv','w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['url', 'description', 'sponso', 'price', 'rating','review_count'])
writer.writerow(record_final)
您必须遍历最终记录才能连续保存每一项。
改变这个:
writer.writerow(record_final)
为此:
for item in record_final:
writer.writerow(item)
您的代码正在按照您的指示执行。
# Retrieve review count (if monetary value, nill it due to missing value)
这就是您得到的
('https://www.amazon.co.uk/G-Anica%C2%AE-Portable-Ultrabook-Earphone-Accessories/dp/B08FCFDPVF/ref=sr_1_10?dchild=1&keywords=laptop&qid=1606453924&sr=8-10', 'G-Anica® Netbook Laptop PC 10 inch Android Portable Ultrabook,Dual Core, Wifi,with Laptop Bag + Mouse + Mouse Pad + Earphone (4 PCS Computer Accessories) (Pink)', 'No', '£119.99', '3.4 out of 5 stars', '21')
('https://www.amazon.co.uk/CHERRY%C2%AE-Notebook-Netbook-Computer-Keyboard/dp/B07ZPW7R14/ref=sr_1_11?dchild=1&keywords=laptop&qid=1606453924&sr=8-11', 'FANCY CHERRY® NEW 2018 HD 10 inch Mini Laptop Notebook Netbook Tablet Computer 1G DDR3 8GB Memory VIA WM8880 CPU Dual Core Android Screen Wifi Camera Keyboard USB HDMI (Black 8GB)', 'No', '£109.99', '3.3 out of 5 stars', '111')
None
None
None
None
None
https://www.amazon.co.uk/s?k=laptop&page=2
现在,如果您访问该页面,有很多没有价格的笔记本电脑。您的代码正在跳过您告诉它的代码。
我之所以联系,是因为我在调整一段代码时遇到了一些麻烦,这段代码应该从亚马逊产品页面(标题、url、产品名称等)中抓取一些信息。抓取训练的经典内容:)
所以我基本上是通过不同的函数来写的:
- 一个函数生成 URL 以抓取
- 一个用于在不同元素之间导航并提取值的函数
最后我只是 运行 我的 driver & beautifulsoup & 启动了这两个功能。
然而,结果并不是我所期望的。我想检索一个有组织的 csv 文件,每个产品检索 1 行,并将每个相关信息放入列中。尽管如此,我总是以 1 或 2 行结束,但不是所有页面的所有产品。
我认为这是我的汤和“for 循环”的结果,它没有正确地遍历所有项目(尽管我无法弄清楚到底是什么)。
我想听听你对此的看法,你有什么线索吗?
非常感谢您的帮助
from bs4 import BeautifulSoup
from selenium import webdriver
import csv
#Function to generate URL with search KW & page nb
def get_url(search_term,page):
template = 'https://www.amazon.co.uk/s?k={}&page='+str(page)
search_term = search_term.replace(' ','+')
url = template.format(search_term)
return url
#Function to retrieve all data from the page
def extract_record(item):
atag = item.h2.a
#Retrieve product name
description = atag.text.strip()
#Retrieve product URL
url = 'https://www.amazon.co.uk' + atag.get('href')
#Retrieve sponsored status
try:
sponso_parent = item.find('span','s-label-popover-default')
sponso = sponso_parent.find('span', {'class': 'a-size-mini a-color-secondary', 'dir': 'auto'}).text
except AttributeError:
sponso = 'No'
#Retrieve price info
try:
price_parent = item.find('span','a-price')
price = price_parent.find('span','a-offscreen').text
except AttributeError:
return
#Retrieve avg product rating
try:
rating = item.i.text
except AttributeError:
rating = ''
#Retrieve review count (if monetary value, nill it due to missing value)
try:
review_count = item.find('span', {'class': 'a-size-base', 'dir': 'auto'}).text
except AttributeError:
review_count = ''
if "£" in review_count or "€" in review_count or "$" in review_count:
review_count = 0
result = (url, description, sponso, price, rating, review_count)
return result
record_final = []
#Loop through page nb
for page in range(1,3):
url = get_url('laptop',page)
print(url)
#Instantiate web driver & retrieve page content with BS (then loop through every product)
driver = webdriver.Chrome('\Users\rapha\Desktop\chromedriver.exe')
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
final_soup = soup.find_all('div',{'data-component-type': 's-search-result'})
try:
for item in final_soup:
record = extract_record(item)
if record:
record_final.append(record)
except AttributeError:
print('error_record')
driver.close()
with open('resultsamz.csv','w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['url', 'description', 'sponso', 'price', 'rating','review_count'])
writer.writerow(record_final)
您必须遍历最终记录才能连续保存每一项。
改变这个:
writer.writerow(record_final)
为此:
for item in record_final:
writer.writerow(item)
您的代码正在按照您的指示执行。
# Retrieve review count (if monetary value, nill it due to missing value)
这就是您得到的
('https://www.amazon.co.uk/G-Anica%C2%AE-Portable-Ultrabook-Earphone-Accessories/dp/B08FCFDPVF/ref=sr_1_10?dchild=1&keywords=laptop&qid=1606453924&sr=8-10', 'G-Anica® Netbook Laptop PC 10 inch Android Portable Ultrabook,Dual Core, Wifi,with Laptop Bag + Mouse + Mouse Pad + Earphone (4 PCS Computer Accessories) (Pink)', 'No', '£119.99', '3.4 out of 5 stars', '21')
('https://www.amazon.co.uk/CHERRY%C2%AE-Notebook-Netbook-Computer-Keyboard/dp/B07ZPW7R14/ref=sr_1_11?dchild=1&keywords=laptop&qid=1606453924&sr=8-11', 'FANCY CHERRY® NEW 2018 HD 10 inch Mini Laptop Notebook Netbook Tablet Computer 1G DDR3 8GB Memory VIA WM8880 CPU Dual Core Android Screen Wifi Camera Keyboard USB HDMI (Black 8GB)', 'No', '£109.99', '3.3 out of 5 stars', '111')
None
None
None
None
None
https://www.amazon.co.uk/s?k=laptop&page=2
现在,如果您访问该页面,有很多没有价格的笔记本电脑。您的代码正在跳过您告诉它的代码。