如何在不失败的情况下抓取多个页面上的数据?
How do I scrape data that is on multiple pages without it failing?
我真的不熟悉抓取数据,抓取多个页面时遇到问题。
我正在尝试获取剧集的标题以及剧集的评分。
我只成功地废弃了第一页,然后就没有用了。
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
url = 'https://www.imdb.com/title/tt0386676/episodes?season=1'
next_season = "//*[@id='load_next_episodes']"
browser = webdriver.Chrome()
browser.get(url)
for season in range(1,10):
i = 1
episodes = browser.find_elements_by_class_name('info')
for episode in episodes:
title = episode.find_element_by_xpath(f'//*[@id="episodes_content"]/div[2]/div[2]/div[{i}]/div[2]/strong/a').text
rating = episode.find_element_by_class_name('ipl-rating-star__rating').text
print(title, rating)
i += 1
browser.find_element_by_xpath(next_season).click()
browser.close()
我的输出如下所示:
Pilot 7.4
Diversity Day 8.2
Health Care 7.7
The Alliance 7.9
Basketball 8.3
Hot Girl 7.6
您也可以在不单击 season button
的情况下获得页面详细信息。
可以先从dropdown box
中获取所有的season number
,然后迭代。
您可以创建列表并在其中附加数据,然后可以在末尾进行迭代,或者可以加载到 dataframe
中,然后导出到 CSV 文件中。
代码:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
driver = webdriver.Chrome()
driver.get("https://www.imdb.com/title/tt0386676/episodes?season=1")
wait=WebDriverWait(driver,10)
selectSeason=wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '#bySeason')))
select=Select(selectSeason)
allSeasons=[option.get_attribute('value') for option in select.options] #get all season numbers
print(allSeasons)
title=[]
ratings=[]
for season in allSeasons:
url="https://www.imdb.com/title/tt0386676/episodes?season={}".format(season)
print(url)
driver.get(url)
for e in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".info"))):
title.append(e.find_element(By.CSS_SELECTOR, "a[itemprop='name']").text)
ratings.append(e.find_element(By.CSS_SELECTOR, ".ipl-rating-star.small .ipl-rating-star__rating").text)
for t , r in zip(title, ratings):
print(t + " --- " + r)
输出:
我真的不熟悉抓取数据,抓取多个页面时遇到问题。 我正在尝试获取剧集的标题以及剧集的评分。
我只成功地废弃了第一页,然后就没有用了。
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
url = 'https://www.imdb.com/title/tt0386676/episodes?season=1'
next_season = "//*[@id='load_next_episodes']"
browser = webdriver.Chrome()
browser.get(url)
for season in range(1,10):
i = 1
episodes = browser.find_elements_by_class_name('info')
for episode in episodes:
title = episode.find_element_by_xpath(f'//*[@id="episodes_content"]/div[2]/div[2]/div[{i}]/div[2]/strong/a').text
rating = episode.find_element_by_class_name('ipl-rating-star__rating').text
print(title, rating)
i += 1
browser.find_element_by_xpath(next_season).click()
browser.close()
我的输出如下所示:
Pilot 7.4
Diversity Day 8.2
Health Care 7.7
The Alliance 7.9
Basketball 8.3
Hot Girl 7.6
您也可以在不单击 season button
的情况下获得页面详细信息。
可以先从dropdown box
中获取所有的season number
,然后迭代。
您可以创建列表并在其中附加数据,然后可以在末尾进行迭代,或者可以加载到 dataframe
中,然后导出到 CSV 文件中。
代码:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
driver = webdriver.Chrome()
driver.get("https://www.imdb.com/title/tt0386676/episodes?season=1")
wait=WebDriverWait(driver,10)
selectSeason=wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '#bySeason')))
select=Select(selectSeason)
allSeasons=[option.get_attribute('value') for option in select.options] #get all season numbers
print(allSeasons)
title=[]
ratings=[]
for season in allSeasons:
url="https://www.imdb.com/title/tt0386676/episodes?season={}".format(season)
print(url)
driver.get(url)
for e in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".info"))):
title.append(e.find_element(By.CSS_SELECTOR, "a[itemprop='name']").text)
ratings.append(e.find_element(By.CSS_SELECTOR, ".ipl-rating-star.small .ipl-rating-star__rating").text)
for t , r in zip(title, ratings):
print(t + " --- " + r)
输出: