python 后 Scraper 不会转到下一页

Question

我正在尝试废弃一个网站 https://lt.brcauto.eu/ 并且需要从那里至少拿走 50 辆车。所以我从主页转到“汽车搜索页面”并从第一个开始抓取所有内容。但是，在一页中只有 21 辆汽车，所以当汽车结束并且解析器应该转到另一页时，我得到一个错误 list index out of range。这就是我试图抓取的方式：

import json
import requests
from bs4 import BeautifulSoup

mainURL = 'https://lt.brcauto.eu/'

req1 = requests.get(mainURL)
soup1 = BeautifulSoup(req1.text, 'lxml')

link = soup1.find('div', class_ = 'home-nav flex flex-wrap')
temp = link.findAll("a") # find search link
URL = (temp[1].get('href') + '/')

req2 = requests.get(URL)
soup2 = BeautifulSoup(req2.text, 'lxml')

page = soup2.find_all('li', class_ = 'page-item')[-2] # search pages till max ">"

cars_printed_counter = 0

for number in range(1, int(page.text)): #from 1 until max page
  req2 = requests.get(URL + '?page=' + str(number)) #page url
  soup2 = BeautifulSoup(req2.text, 'lxml')

  if cars_printed_counter == 50:
      break # due faster execution

out = [] # holding all cars

for single_car in soup2.find_all('div', class_ = 'cars-wrapper'):

    if cars_printed_counter == 50:
        break # after 5 cars

    Car_Title = single_car.find('h2', class_ = 'cars__title')
    Car_Specs = single_car.find('p', class_ = 'cars__subtitle')


    #print('\nCar number:', cars_printed_counter + 1)
    #print(Car_Title.text)
    #print(Car_Specs.text)
    
    car = {}
    spl = Car_Specs.text.split(' | ')
    car["fuel"] = spl [1].split(" ")[1]
    car["Title"] = str(Car_Title.text)
    car["Year"] = int(spl [0])
    car["run"] = int(spl [3].split(" ")[0])
    car["type"] = spl [5]
    car["number"] = cars_printed_counter + 1
    out.append(car)
    cars_printed_counter += 1

print(json.dumps(out))
with open("outfile.json", "w") as f:
    f.write(json.dumps(out))

我注意到如果我只打印这样的汽车

for single_car in soup.find_all('div', class_ = 'cars-wrapper'):

    if cars_printed_counter == 50:
        break

    Car_Title = single_car.find('h2', class_ = 'cars__title')
    Car_Specs = single_car.find('p', class_ = 'cars__subtitle')
    Car_Price = single_car.find('div', class_ = 'w-full lg:w-auto cars-price text-right pt-1')

    print('\nCar number:', cars_printed_counter + 1)

    print(Car_Title.text)
    print(Car_Specs.text)
    print(Car_Price.text)

    cars_printed_counter += 1

一切都很好。但是一旦我想把它们写成 json 这样的格式：

car = {}
    spl = Car_Specs.text.split(' | ')
    car["fuel"] = spl [1].split(" ")[1]
    car["Title"] = str(Car_Title.text)
    car["Year"] = int(spl [0])
    car["run"] = int(spl [3].split(" ")[0])
    car["type"] = spl [5]
    car["number"] = cars_printed_counter + 1
    out.append(car)

    cars_printed_counter += 1

print(json.dumps(out))
with open("outfile.json", "w") as f:
    f.write(json.dumps(out))

我收到列表索引超出范围的错误。

P.S。或者我应该在这里使用多线程？

Answer 1

首先 - 暂时搁置多线程的想法。您的代码还有其他问题：

如前所述，检查问题代码中的缩进，目前没有任何意义，因为您正在遍历所有站点，但只抓取最后一个。
导致 IndexError: list index out of range
的问题

打印您的 spl，您将看到以下问题 - 这辆车不运行内燃机：

['2013', 'Elektra', 'Automatinė', '108030 km', '310 kW (422 AG)', 'Mėlyna']

尝试 select 像您一样索引 car["fuel"] = spl [1].split(" ")[1] 会导致错误，而不是像这样（列表中的最后一个元素）：

car["fuel"] = spl [1].split(" ")[-1]

例子

您的缩进应该看起来更像这样以迭代所有页面并将汽车信息存储在所有循环之外的 out 中：

...
cars_printed_counter = 0

out = [] # holding all cars

for number in range(1, int(page.text)): #from 1 until max page
    req2 = requests.get(URL + '?page=' + str(number)) #page url
    soup2 = BeautifulSoup(req2.text, 'lxml')

    if cars_printed_counter == 50:
        break # due faster execution

    for single_car in soup2.find_all('div', class_ = 'cars-wrapper'):

        if cars_printed_counter == 50:
            break # after 5 cars

        Car_Title = single_car.find('h2', class_ = 'cars__title')
        Car_Specs = single_car.find('p', class_ = 'cars__subtitle')

        car = {}
        spl = Car_Specs.text.split(' | ')
        print(spl)
        car["fuel"] = spl [1].split(" ")[-1]
        car["Title"] = str(Car_Title.text)
        car["Year"] = int(spl [0])
        car["run"] = int(spl [3].split(" ")[0])
        car["type"] = spl [5]
        car["number"] = cars_printed_counter + 1
        out.append(car)
        cars_printed_counter += 1

# print(json.dumps(out))
with open("outfile.json", "w") as f:
    f.write(json.dumps(out))

Answer 2

这个解决方案对我有用：

        car = {}
        spl = Car_Specs.text.split(' | ')
        if spl[1].split(" ")[0] == 'Elektra': # break on Electric cars
            break
        car["fuel"] = spl [1].split(" ")[1]
        car["Title"] = str(Car_Title.text)
        car["Year"] = int(spl [0])
        car["run"] = int(spl [3].split(" ")[0])
        car["type"] = spl [5]
        car["number"] = cars_printed_counter + 1
        out.append(car)
        cars_printed_counter += 1

    print(json.dumps(out))
    with open("outfile.json", "w") as f:
        f.write(json.dumps(out))

所以我补充说：

if spl[1].split(" ")[0] == 'Elektra':

break

因为在抓取第二个元素时是包含一升的燃料类型。而当刮板遇到电动车时dict加不上，因为电动车没有升。 [0] is fuel type

python 后 Scraper 不会转到下一页

Scraper does not go to next pages in python

python

multithreading

json

web-scraping

例子