BeautifulSoup scrape - 检索产品列表失败

BeautifulSoup scrape - Fail to retrieve product list

我之所以联系,是因为我在调整一段代码时遇到了一些麻烦,这段代码应该从亚马逊产品页面(标题、url、产品名称等)中抓取一些信息。抓取训练的经典内容:)

所以我基本上是通过不同的函数来写的:

  1. 一个函数生成 URL 以抓取
  2. 一个用于在不同元素之间导航并提取值的函数

最后我只是 运行 我的 driver & beautifulsoup & 启动了这两个功能。

然而,结果并不是我所期望的。我想检索一个有组织的 csv 文件,每个产品检索 1 行,并将每个相关信息放入列中。尽管如此,我总是以 1 或 2 行结束,但不是所有页面的所有产品。

我认为这是我的汤和“for 循环”的结果,它没有正确地遍历所有项目(尽管我无法弄清楚到底是什么)。

我想听听你对此的看法,你有什么线索吗?

非常感谢您的帮助

from bs4 import BeautifulSoup
from selenium import webdriver
import csv

#Function to generate URL with search KW & page nb
def get_url(search_term,page):
    template = 'https://www.amazon.co.uk/s?k={}&page='+str(page)
    search_term = search_term.replace(' ','+')
    url = template.format(search_term)

    return url

#Function to retrieve all data from the page
def extract_record(item):
    atag = item.h2.a
    
    #Retrieve product name
    description = atag.text.strip()
    
    #Retrieve product URL
    url = 'https://www.amazon.co.uk' + atag.get('href')
    
    #Retrieve sponsored status
    try:
        sponso_parent = item.find('span','s-label-popover-default')
        sponso = sponso_parent.find('span', {'class': 'a-size-mini a-color-secondary', 'dir': 'auto'}).text
    except AttributeError:
        sponso = 'No' 

    #Retrieve price info
    try:
        price_parent = item.find('span','a-price')
        price = price_parent.find('span','a-offscreen').text
    except AttributeError:
        return
    
    #Retrieve avg product rating
    try:
        rating = item.i.text
    except AttributeError:
        rating = ''
    
    #Retrieve review count (if monetary value, nill it due to missing value)
    try:
        review_count = item.find('span', {'class': 'a-size-base', 'dir': 'auto'}).text
    except AttributeError:
        review_count = ''
    
    if "£" in review_count or "€" in review_count or "$" in review_count:
        review_count = 0
    
    result = (url, description, sponso, price, rating,  review_count)
    
    return result
        
record_final = []

#Loop through page nb
for page in range(1,3):
    url = get_url('laptop',page)
    print(url)
    
    #Instantiate web driver & retrieve page content with BS (then loop through every product)
    driver = webdriver.Chrome('\Users\rapha\Desktop\chromedriver.exe')
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    final_soup = soup.find_all('div',{'data-component-type': 's-search-result'})
    
    try:
        for item in final_soup:
            record = extract_record(item)
            if record:
                record_final.append(record)
    except AttributeError:
        print('error_record')
    
    driver.close()

with open('resultsamz.csv','w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['url', 'description', 'sponso', 'price', 'rating','review_count'])
    writer.writerow(record_final)


您必须遍历最终记录才能连续保存每一项。

改变这个:

writer.writerow(record_final)

为此:

for item in record_final:
    writer.writerow(item)

您的代码正在按照您的指示执行。

# Retrieve review count (if monetary value, nill it due to missing value)

这就是您得到的

('https://www.amazon.co.uk/G-Anica%C2%AE-Portable-Ultrabook-Earphone-Accessories/dp/B08FCFDPVF/ref=sr_1_10?dchild=1&keywords=laptop&qid=1606453924&sr=8-10', 'G-Anica® Netbook Laptop PC 10 inch Android Portable Ultrabook,Dual Core, Wifi,with Laptop Bag + Mouse + Mouse Pad + Earphone (4 PCS Computer Accessories) (Pink)', 'No', '£119.99', '3.4 out of 5 stars', '21')
('https://www.amazon.co.uk/CHERRY%C2%AE-Notebook-Netbook-Computer-Keyboard/dp/B07ZPW7R14/ref=sr_1_11?dchild=1&keywords=laptop&qid=1606453924&sr=8-11', 'FANCY CHERRY® NEW 2018 HD 10 inch Mini Laptop Notebook Netbook Tablet Computer 1G DDR3 8GB Memory VIA WM8880 CPU Dual Core Android Screen Wifi Camera Keyboard USB HDMI (Black 8GB)', 'No', '£109.99', '3.3 out of 5 stars', '111')
None
None
None
None
None
https://www.amazon.co.uk/s?k=laptop&page=2

现在,如果您访问该页面,有很多没有价格的笔记本电脑。您的代码正在跳过您告诉它的代码。