从 IMDB 获取剧集评论
Get episode reviews from IMDB
我正在尝试从 IMDB
以及他们的评论中抓取剧集数据。我想获取所有剧集并将它们存储在 dataframe
中。但是我遇到了一个问题:每集只抓取了 1 条评论。当我测试时,有一个例子,所有的评论都被删除了,但它不再起作用了。有谁知道我如何抓取所有评论并将其存储在 dataframe
中?
代码如下:
library(dplyr)
library(rvest)
library(tidyverse)
getReviewLink = function(episodeLink) {
episodePage = read_html(episodeLink)
container = episodePage %>%
html_nodes(".Hero__WatchContainer__NoVideo-sc-kvkd64-9.cTdSBT")
reviewLinks = episodePage %>%
html_nodes(".Hero__WatchContainer__NoVideo-sc-kvkd64-9.cTdSBT > ul > li:nth-child(1) > a") %>%
html_attr("href") %>%
paste("https://www.imdb.com", ., sep="")
print(reviewLinks)
cleanedReviewLink = ifelse(reviewLinks == "https://www.imdb.com", NA, reviewLinks)
print(cleanedReviewLink)
get_reviews = ifelse(is.na(cleanedReviewLink), NA, read_html(cleanedReviewLink) %>% html_nodes(".show-more__control") %>%
html_text() %>% str_trim())
print(get_reviews)
return(get_reviews)
}
episodes = data.frame()
for (page_result in seq(from = 1, to = 51, by = 50)){
link = paste0("https://www.imdb.com/search/title/?title_type=tv_episode&num_votes=1000,&sort=user_rating,desc&start"
,page_result,"&ref_=adv_nxt")
page = read_html(link)
show_name = page %>% html_nodes(".lister-item-index+ a") %>% html_text() %>% str_trim
episode_name = page %>% html_nodes("small+ a") %>% html_text()
episode_links = page %>% html_nodes("small+ a") %>% html_attr("href") %>%
paste("https://www.imdb.com", ., sep="")
episodeReview = sapply(episode_links, FUN = getReviewLink, USE.NAMES = FALSE)
print(episodeReview)
episodes = rbind(episodes, data.frame(show_name, episode_name, episodeReview, stringsAsFactors = FALSE))
print(paste("Page:", page_result))
}
感谢任何帮助。
我运行你的代码,你的函数有一个小错误getReviewLink
。
以下部分将删除所有评论并仅重新调整第一条评论。
get_reviews = ifelse(is.na(cleanedReviewLink), NA, read_html(cleanedReviewLink) %>% html_nodes(".show-more__control") %>%
html_text() %>% str_trim())
将其替换为
get_reviews = read_html(cleanedReviewLink) %>% html_nodes(".show-more__control") %>%
html_text() %>% str_trim() %>% str_subset(".+")
[1] "I haven't seen every episode in the world, but this is as close to perfect as I have ever seen. Never thought I would say something could match the likes of Lord of the Rings. It's the most cinematic episode I've ever seen with maybe \"Battle of the Bastards\" being alongside it for obvious reasons, but for an animated episode to do this is even more shocking. It would be hard for someone to imagine an animated episode being as cinematic as an HBO production and this episode made it possible. This episode was also very emotional as two of my favorite characters' (Armin and Erwin) lives were on the line. I even cried when Armin was gonna make the sacrifice. It was also sad to see Erwin's unfortunate plan come to fruition, but he did it knowing it was for a greater good. I also loved the parallels of all the main characters sacrificing themselves in this episode."
[2] "I cant understand how anyone can rate something as incredible like this below 10. This episode is amazingly godly and will go down in history as one of the greats"
此外,您实际上并没有抓取所有评论。例如剧集有 951 条评论https://www.imdb.com/title/tt9906260/reviews?ref_=tt_ov_rt
但是您的代码只能获得前 25 条评论。如果您需要显示所有评论,您需要继续点击 加载更多。这可以是 RSelenium
也可以是 imdbapi
.
我正在尝试从 IMDB
以及他们的评论中抓取剧集数据。我想获取所有剧集并将它们存储在 dataframe
中。但是我遇到了一个问题:每集只抓取了 1 条评论。当我测试时,有一个例子,所有的评论都被删除了,但它不再起作用了。有谁知道我如何抓取所有评论并将其存储在 dataframe
中?
代码如下:
library(dplyr)
library(rvest)
library(tidyverse)
getReviewLink = function(episodeLink) {
episodePage = read_html(episodeLink)
container = episodePage %>%
html_nodes(".Hero__WatchContainer__NoVideo-sc-kvkd64-9.cTdSBT")
reviewLinks = episodePage %>%
html_nodes(".Hero__WatchContainer__NoVideo-sc-kvkd64-9.cTdSBT > ul > li:nth-child(1) > a") %>%
html_attr("href") %>%
paste("https://www.imdb.com", ., sep="")
print(reviewLinks)
cleanedReviewLink = ifelse(reviewLinks == "https://www.imdb.com", NA, reviewLinks)
print(cleanedReviewLink)
get_reviews = ifelse(is.na(cleanedReviewLink), NA, read_html(cleanedReviewLink) %>% html_nodes(".show-more__control") %>%
html_text() %>% str_trim())
print(get_reviews)
return(get_reviews)
}
episodes = data.frame()
for (page_result in seq(from = 1, to = 51, by = 50)){
link = paste0("https://www.imdb.com/search/title/?title_type=tv_episode&num_votes=1000,&sort=user_rating,desc&start"
,page_result,"&ref_=adv_nxt")
page = read_html(link)
show_name = page %>% html_nodes(".lister-item-index+ a") %>% html_text() %>% str_trim
episode_name = page %>% html_nodes("small+ a") %>% html_text()
episode_links = page %>% html_nodes("small+ a") %>% html_attr("href") %>%
paste("https://www.imdb.com", ., sep="")
episodeReview = sapply(episode_links, FUN = getReviewLink, USE.NAMES = FALSE)
print(episodeReview)
episodes = rbind(episodes, data.frame(show_name, episode_name, episodeReview, stringsAsFactors = FALSE))
print(paste("Page:", page_result))
}
感谢任何帮助。
我运行你的代码,你的函数有一个小错误getReviewLink
。
以下部分将删除所有评论并仅重新调整第一条评论。
get_reviews = ifelse(is.na(cleanedReviewLink), NA, read_html(cleanedReviewLink) %>% html_nodes(".show-more__control") %>%
html_text() %>% str_trim())
将其替换为
get_reviews = read_html(cleanedReviewLink) %>% html_nodes(".show-more__control") %>%
html_text() %>% str_trim() %>% str_subset(".+")
[1] "I haven't seen every episode in the world, but this is as close to perfect as I have ever seen. Never thought I would say something could match the likes of Lord of the Rings. It's the most cinematic episode I've ever seen with maybe \"Battle of the Bastards\" being alongside it for obvious reasons, but for an animated episode to do this is even more shocking. It would be hard for someone to imagine an animated episode being as cinematic as an HBO production and this episode made it possible. This episode was also very emotional as two of my favorite characters' (Armin and Erwin) lives were on the line. I even cried when Armin was gonna make the sacrifice. It was also sad to see Erwin's unfortunate plan come to fruition, but he did it knowing it was for a greater good. I also loved the parallels of all the main characters sacrificing themselves in this episode."
[2] "I cant understand how anyone can rate something as incredible like this below 10. This episode is amazingly godly and will go down in history as one of the greats"
此外,您实际上并没有抓取所有评论。例如剧集有 951 条评论https://www.imdb.com/title/tt9906260/reviews?ref_=tt_ov_rt
但是您的代码只能获得前 25 条评论。如果您需要显示所有评论,您需要继续点击 加载更多。这可以是 RSelenium
也可以是 imdbapi
.