如何在不失败的情况下抓取多个页面上的数据？

Question

我真的不熟悉抓取数据，抓取多个页面时遇到问题。我正在尝试获取剧集的标题以及剧集的评分。

我只成功地废弃了第一页，然后就没有用了。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

url = 'https://www.imdb.com/title/tt0386676/episodes?season=1'

next_season = "//*[@id='load_next_episodes']"

browser = webdriver.Chrome()
browser.get(url)

for season in range(1,10):
    i = 1
    episodes = browser.find_elements_by_class_name('info')
    for episode in episodes:
        title = episode.find_element_by_xpath(f'//*[@id="episodes_content"]/div[2]/div[2]/div[{i}]/div[2]/strong/a').text
        rating = episode.find_element_by_class_name('ipl-rating-star__rating').text
        print(title, rating)
        i += 1

    browser.find_element_by_xpath(next_season).click()
browser.close()

我的输出如下所示：

Pilot 7.4
Diversity Day 8.2
Health Care 7.7
The Alliance 7.9
Basketball 8.3
Hot Girl 7.6

Answer 1

您也可以在不单击 season button 的情况下获得页面详细信息。可以先从dropdown box中获取所有的season number，然后迭代。您可以创建列表并在其中附加数据，然后可以在末尾进行迭代，或者可以加载到 dataframe 中，然后导出到 CSV 文件中。

代码：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select

driver = webdriver.Chrome()
driver.get("https://www.imdb.com/title/tt0386676/episodes?season=1")
wait=WebDriverWait(driver,10)
selectSeason=wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '#bySeason')))
select=Select(selectSeason)
allSeasons=[option.get_attribute('value') for option in select.options] #get all season numbers
print(allSeasons)
title=[]
ratings=[]
for season in allSeasons:
    url="https://www.imdb.com/title/tt0386676/episodes?season={}".format(season)
    print(url)
    driver.get(url)
    for e in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".info"))):
        title.append(e.find_element(By.CSS_SELECTOR, "a[itemprop='name']").text)
        ratings.append(e.find_element(By.CSS_SELECTOR, ".ipl-rating-star.small .ipl-rating-star__rating").text)
    
for t , r in zip(title, ratings):
    print(t + " --- " + r)

输出:

如何在不失败的情况下抓取多个页面上的数据？

How do I scrape data that is on multiple pages without it failing?

python

selenium

web-scraping

selenium-webdriver

webdriverwait