如何从烂番茄上抓取超过一页的评论家评论?
How to scrape more than one page of critic reviews from Rotten Tomatoes?
我一直在使用这个抓取工具来抓取评论家评论URL:https://www.rottentomatoes.com/m/avengers_endgame/reviews
虽然,我一直在为如何浏览其他页面而苦苦挣扎,因为目前这会刮掉第一页的评论家评论。有谁知道我会怎么做?
import selenium
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome()
driver.get("https://www.rottentomatoes.com/m/avengers_endgame/reviews")
review_1df = pd.DataFrame(columns=['Date', 'Reviewer', 'Website', 'Review', 'Score'])
dates = []
reviews = []
scores = []
newscores = []
names = []
sites = []
results = driver.find_elements_by_class_name("review_area")
reviewnum = 1
reviewers = driver.find_elements_by_class_name("col-xs-8")
for r in results:
dates.append(r.find_element_by_class_name('subtle').text)
reviews.append(r.find_element_by_class_name('the_review').text)
revs = r.find_element_by_class_name('review_desc')
scores.append(revs.find_element_by_class_name('subtle').text)
for r in reviewers:
names.append(r.find_element_by_xpath('//*[@id="reviews"]/div[2]/div[4]/div[' +str(reviewnum)+ ']/div[1]/div[3]/a[1]').text)
sites.append(r.find_element_by_xpath('//*[@id="reviews"]/div[2]/div[4]/div[' +str(reviewnum)+']/div[1]/div[3]/a[2]/em').text)
reviewnum+=1
for score in scores:
if score == ('Full Review'):
newscores.append('no score')
else:
score2 = score[14:]
newscores.append(score2)
review_1df['Date'] = dates
review_1df['Review'] = reviews
review_1df['Score'] = newscores
review_1df['Reviewer'] = names
review_1df['Website'] = sites
您可以使用 URL 参数转到下一页评论并重复相同的步骤。例如,以下 url 将带您进入评论的第二页:
https://www.rottentomatoes.com/m/avengers_endgame/reviews?type=&sort=&page=2
请注意参数是 type=&sort=&page=2
,您还可以在其中指定排序和类型。将其更改为 page=3
以转到第三页。
您还必须添加检查以查看该页面是否存在。例如,您不会收到关于此 URL:
的评论
https://www.rottentomatoes.com/m/avengers_endgame/reviews?type=&sort=&page=200000
我一直在使用这个抓取工具来抓取评论家评论URL:https://www.rottentomatoes.com/m/avengers_endgame/reviews 虽然,我一直在为如何浏览其他页面而苦苦挣扎,因为目前这会刮掉第一页的评论家评论。有谁知道我会怎么做?
import selenium
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome()
driver.get("https://www.rottentomatoes.com/m/avengers_endgame/reviews")
review_1df = pd.DataFrame(columns=['Date', 'Reviewer', 'Website', 'Review', 'Score'])
dates = []
reviews = []
scores = []
newscores = []
names = []
sites = []
results = driver.find_elements_by_class_name("review_area")
reviewnum = 1
reviewers = driver.find_elements_by_class_name("col-xs-8")
for r in results:
dates.append(r.find_element_by_class_name('subtle').text)
reviews.append(r.find_element_by_class_name('the_review').text)
revs = r.find_element_by_class_name('review_desc')
scores.append(revs.find_element_by_class_name('subtle').text)
for r in reviewers:
names.append(r.find_element_by_xpath('//*[@id="reviews"]/div[2]/div[4]/div[' +str(reviewnum)+ ']/div[1]/div[3]/a[1]').text)
sites.append(r.find_element_by_xpath('//*[@id="reviews"]/div[2]/div[4]/div[' +str(reviewnum)+']/div[1]/div[3]/a[2]/em').text)
reviewnum+=1
for score in scores:
if score == ('Full Review'):
newscores.append('no score')
else:
score2 = score[14:]
newscores.append(score2)
review_1df['Date'] = dates
review_1df['Review'] = reviews
review_1df['Score'] = newscores
review_1df['Reviewer'] = names
review_1df['Website'] = sites
您可以使用 URL 参数转到下一页评论并重复相同的步骤。例如,以下 url 将带您进入评论的第二页:
https://www.rottentomatoes.com/m/avengers_endgame/reviews?type=&sort=&page=2
请注意参数是 type=&sort=&page=2
,您还可以在其中指定排序和类型。将其更改为 page=3
以转到第三页。
您还必须添加检查以查看该页面是否存在。例如,您不会收到关于此 URL:
的评论https://www.rottentomatoes.com/m/avengers_endgame/reviews?type=&sort=&page=200000