如何在不失败的情况下抓取多个页面上的数据?

How do I scrape data that is on multiple pages without it failing?

我真的不熟悉抓取数据,抓取多个页面时遇到问题。 我正在尝试获取剧集的标题以及剧集的评分。

我只成功地废弃了第一页,然后就没有用了。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

url = 'https://www.imdb.com/title/tt0386676/episodes?season=1'

next_season = "//*[@id='load_next_episodes']"

browser = webdriver.Chrome()
browser.get(url)

for season in range(1,10):
    i = 1
    episodes = browser.find_elements_by_class_name('info')
    for episode in episodes:
        title = episode.find_element_by_xpath(f'//*[@id="episodes_content"]/div[2]/div[2]/div[{i}]/div[2]/strong/a').text
        rating = episode.find_element_by_class_name('ipl-rating-star__rating').text
        print(title, rating)
        i += 1

    browser.find_element_by_xpath(next_season).click()
browser.close()

我的输出如下所示:

Pilot 7.4
Diversity Day 8.2
Health Care 7.7
The Alliance 7.9
Basketball 8.3
Hot Girl 7.6

您也可以在不单击 season button 的情况下获得页面详细信息。 可以先从dropdown box中获取所有的season number,然后迭代。 您可以创建列表并在其中附加数据,然后可以在末尾进行迭代,或者可以加载到 dataframe 中,然后导出到 CSV 文件中。

代码:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select

driver = webdriver.Chrome()
driver.get("https://www.imdb.com/title/tt0386676/episodes?season=1")
wait=WebDriverWait(driver,10)
selectSeason=wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '#bySeason')))
select=Select(selectSeason)
allSeasons=[option.get_attribute('value') for option in select.options] #get all season numbers
print(allSeasons)
title=[]
ratings=[]
for season in allSeasons:
    url="https://www.imdb.com/title/tt0386676/episodes?season={}".format(season)
    print(url)
    driver.get(url)
    for e in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".info"))):
        title.append(e.find_element(By.CSS_SELECTOR, "a[itemprop='name']").text)
        ratings.append(e.find_element(By.CSS_SELECTOR, ".ipl-rating-star.small .ipl-rating-star__rating").text)
    
for t , r in zip(title, ratings):
    print(t + " --- " + r)

输出: