我们如何在 R 中从 IMDB 中删除缺失值?

How can we scrape missing values from IMDB in R?

library(rvest)

imdb_page <- read_html("https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv")
title <- imdb_page %>% html_nodes(".lister-item-header a") %>% html_text()
rating <- imdb_page %>% html_nodes(".ratings-imdb-rating strong") %>% html_text()
movies <- data.frame(title)
movies2 <- data.frame(rating)

基本上,上面的代码用于抓取 50 部电影的标题和评分。我希望缺失值也包含在 NA 中。

然而,这并没有发生,因为 IMDB 没有将它们包含在 HTML 标签中,该标签只有实际值(我使用 SelectorGadget 来获取标签)。因此,标题的观察计数为 50,评级的观察计数仅为 33,这不是我想要的。我曾尝试将 html_node() 与 html_nodes() 一起使用,但 R 给出了一条错误消息,指出不能将 cssxpath 一起使用。我也试过 trim=TRUE 和 replace(!nzchar(.), NA) 但它们也不起作用。

有没有办法解决这个问题并确保我获得 50 个评分(包括 NA 或空值)?

我们可以使用 ratings-user-rating 来获取整个评分列表,

library(rvest)
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv" 

url %>% read_html() %>% html_nodes('.ratings-user-rating') %>% html_text2()

 [1] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.6/10 X "
 [4] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.1/10 X "
 [7] "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[10] "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.3/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[13] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.5/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5/10 X "  
[16] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.7/10 X "
[19] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[22] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.7/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[25] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.3/10 X "
[28] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[31] "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[34] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 8.3/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.1/10 X "
[37] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.8/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.3/10 X "
[40] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.2/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[43] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[46] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.8/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X "
[49] "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 1.5/10 X "

我们还需要清理数据以获得收视率。

df %>% gsub(".*9 10", "", .) %>% str_sub(start=1, end=-7) %>% str_replace_all('-', replacement = NA_character_)

 [1] NA     NA     " 3.6" NA     " 4.9" " 4.1" " 7.4" " 4.6" NA     " 7.9" " 3.3" NA     " 6.5" " 6.6" " 5"   " 3.6" NA     " 4.7" " 3.1" " 5.4" NA     " 5.7"
[23] " 5.1" NA     NA     " 6.9" " 4.3" " 6.6" NA     NA     " 4.1" " 4.6" NA     " 5.1" " 8.3" " 7.1" " 5.8" " 3.4" " 3.3" " 3.2" NA     NA     NA     " 4.6"
[45] NA     " 6.8" NA     " 6.6" " 4.4" " 1.5"

获取电影名称,

movie = url %>% read_html() %>%  html_nodes(".lister-item-header a") %>% html_text()

data.frame(Movie = movie, ratings = df)
                                                    Movie Ratings
1                                              #1915House    <NA>
2                                              #Bodygoals    <NA>
3                                               #Followme     3.6
4                                             #FullMethod    <NA>
5                                                   #Like     4.9
6                                             #SquadGoals     4.1

您需要分两步执行此解析。首先收集所有 50 部电影的父节点 html_nodes()。然后,您使用 html_node()(没有 s)解析这个节点集合,以获得所有 50 个节点的结果,包括缺少该属性的节点。

library(rvest)
library(dplyr)

imdb_page <- read_html("https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv")

#get the parent node of the each movie
movies <- imdb_page %>% html_elements( "div.lister-item")

#now parse each movie node for the desired subnode
title <- movies %>% html_element(".lister-item-header a") %>% html_text()
rating <- movies %>% html_element(".ratings-imdb-rating strong") %>% html_text()

注意 rvest 1.0 中的当前样式从 html_node(s) 更新到 html_element(s)