我们如何在 R 中从 IMDB 中删除缺失值?
How can we scrape missing values from IMDB in R?
library(rvest)
imdb_page <- read_html("https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv")
title <- imdb_page %>% html_nodes(".lister-item-header a") %>% html_text()
rating <- imdb_page %>% html_nodes(".ratings-imdb-rating strong") %>% html_text()
movies <- data.frame(title)
movies2 <- data.frame(rating)
基本上,上面的代码用于抓取 50 部电影的标题和评分。我希望缺失值也包含在 NA 中。
然而,这并没有发生,因为 IMDB 没有将它们包含在 HTML 标签中,该标签只有实际值(我使用 SelectorGadget
来获取标签)。因此,标题的观察计数为 50,评级的观察计数仅为 33,这不是我想要的。我曾尝试将 html_node() 与 html_nodes()
一起使用,但 R 给出了一条错误消息,指出不能将 css
和 xpath
一起使用。我也试过 trim=TRUE 和 replace(!nzchar(.), NA
) 但它们也不起作用。
有没有办法解决这个问题并确保我获得 50 个评分(包括 NA 或空值)?
我们可以使用 ratings-user-rating
来获取整个评分列表,
library(rvest)
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv"
url %>% read_html() %>% html_nodes('.ratings-user-rating') %>% html_text2()
[1] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.6/10 X "
[4] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.1/10 X "
[7] "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[10] "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.3/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[13] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.5/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5/10 X "
[16] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.7/10 X "
[19] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[22] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.7/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[25] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.3/10 X "
[28] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[31] "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[34] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 8.3/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.1/10 X "
[37] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.8/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.3/10 X "
[40] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.2/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[43] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[46] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.8/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X "
[49] "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 1.5/10 X "
我们还需要清理数据以获得收视率。
df %>% gsub(".*9 10", "", .) %>% str_sub(start=1, end=-7) %>% str_replace_all('-', replacement = NA_character_)
[1] NA NA " 3.6" NA " 4.9" " 4.1" " 7.4" " 4.6" NA " 7.9" " 3.3" NA " 6.5" " 6.6" " 5" " 3.6" NA " 4.7" " 3.1" " 5.4" NA " 5.7"
[23] " 5.1" NA NA " 6.9" " 4.3" " 6.6" NA NA " 4.1" " 4.6" NA " 5.1" " 8.3" " 7.1" " 5.8" " 3.4" " 3.3" " 3.2" NA NA NA " 4.6"
[45] NA " 6.8" NA " 6.6" " 4.4" " 1.5"
获取电影名称,
movie = url %>% read_html() %>% html_nodes(".lister-item-header a") %>% html_text()
data.frame(Movie = movie, ratings = df)
Movie Ratings
1 #1915House <NA>
2 #Bodygoals <NA>
3 #Followme 3.6
4 #FullMethod <NA>
5 #Like 4.9
6 #SquadGoals 4.1
您需要分两步执行此解析。首先收集所有 50 部电影的父节点 html_nodes()
。然后,您使用 html_node()
(没有 s)解析这个节点集合,以获得所有 50 个节点的结果,包括缺少该属性的节点。
library(rvest)
library(dplyr)
imdb_page <- read_html("https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv")
#get the parent node of the each movie
movies <- imdb_page %>% html_elements( "div.lister-item")
#now parse each movie node for the desired subnode
title <- movies %>% html_element(".lister-item-header a") %>% html_text()
rating <- movies %>% html_element(".ratings-imdb-rating strong") %>% html_text()
注意 rvest 1.0 中的当前样式从 html_node(s)
更新到 html_element(s)
library(rvest)
imdb_page <- read_html("https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv")
title <- imdb_page %>% html_nodes(".lister-item-header a") %>% html_text()
rating <- imdb_page %>% html_nodes(".ratings-imdb-rating strong") %>% html_text()
movies <- data.frame(title)
movies2 <- data.frame(rating)
基本上,上面的代码用于抓取 50 部电影的标题和评分。我希望缺失值也包含在 NA 中。
然而,这并没有发生,因为 IMDB 没有将它们包含在 HTML 标签中,该标签只有实际值(我使用 SelectorGadget
来获取标签)。因此,标题的观察计数为 50,评级的观察计数仅为 33,这不是我想要的。我曾尝试将 html_node() 与 html_nodes()
一起使用,但 R 给出了一条错误消息,指出不能将 css
和 xpath
一起使用。我也试过 trim=TRUE 和 replace(!nzchar(.), NA
) 但它们也不起作用。
有没有办法解决这个问题并确保我获得 50 个评分(包括 NA 或空值)?
我们可以使用 ratings-user-rating
来获取整个评分列表,
library(rvest)
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv"
url %>% read_html() %>% html_nodes('.ratings-user-rating') %>% html_text2()
[1] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.6/10 X "
[4] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.1/10 X "
[7] "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[10] "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.3/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[13] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.5/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5/10 X "
[16] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.7/10 X "
[19] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[22] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.7/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[25] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.3/10 X "
[28] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[31] "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[34] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 8.3/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.1/10 X "
[37] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.8/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.3/10 X "
[40] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.2/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[43] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[46] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.8/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X "
[49] "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 1.5/10 X "
我们还需要清理数据以获得收视率。
df %>% gsub(".*9 10", "", .) %>% str_sub(start=1, end=-7) %>% str_replace_all('-', replacement = NA_character_)
[1] NA NA " 3.6" NA " 4.9" " 4.1" " 7.4" " 4.6" NA " 7.9" " 3.3" NA " 6.5" " 6.6" " 5" " 3.6" NA " 4.7" " 3.1" " 5.4" NA " 5.7"
[23] " 5.1" NA NA " 6.9" " 4.3" " 6.6" NA NA " 4.1" " 4.6" NA " 5.1" " 8.3" " 7.1" " 5.8" " 3.4" " 3.3" " 3.2" NA NA NA " 4.6"
[45] NA " 6.8" NA " 6.6" " 4.4" " 1.5"
获取电影名称,
movie = url %>% read_html() %>% html_nodes(".lister-item-header a") %>% html_text()
data.frame(Movie = movie, ratings = df)
Movie Ratings
1 #1915House <NA>
2 #Bodygoals <NA>
3 #Followme 3.6
4 #FullMethod <NA>
5 #Like 4.9
6 #SquadGoals 4.1
您需要分两步执行此解析。首先收集所有 50 部电影的父节点 html_nodes()
。然后,您使用 html_node()
(没有 s)解析这个节点集合,以获得所有 50 个节点的结果,包括缺少该属性的节点。
library(rvest)
library(dplyr)
imdb_page <- read_html("https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv")
#get the parent node of the each movie
movies <- imdb_page %>% html_elements( "div.lister-item")
#now parse each movie node for the desired subnode
title <- movies %>% html_element(".lister-item-header a") %>% html_text()
rating <- movies %>% html_element(".ratings-imdb-rating strong") %>% html_text()
注意 rvest 1.0 中的当前样式从 html_node(s)
更新到 html_element(s)