rvest::html_nodes returns 部分列表(只有几项)
rvest::html_nodes returns a partial list (only a few items)
使用 rvest 包,我试图从电影肯尼迪 (https://www.imdb.com/title/tt0102138/fullcredits?ref_=tt_ql_1) 的 IMDB 页面中抓取 actors/actresses 的名字。
SelectorGadget 说我想找的地方是每个人的 "td:nth-child(2)"。
这是我正在使用的代码。
library(rvest)
library(stringr)
startFilm <- "tt0102138" #JFK
personsNames <- c()
pagePath <- paste("https://www.imdb.com/title/", startFilm, "/?ref_=nv_sr_1?ref_=nv_sr_1", sep = "")
moviePage <- read_html(pagePath)
personNodes <- html_nodes(moviePage, "td:nth-child(2)")
personText <- html_text(personNodes)
for (i in 1:length(personText)){
actor <- (unlist(str_split(personText[i], "\n")))[2]
personsNames[i] <- substring(actor, 2, nchar(actor))
}
personsNames
根据 https://www.imdb.com/title/tt0102138/fullcredits?ref_=tt_ql_1 的网站,此列表应该相当长。
然而,当我 运行 代码时,我只得到 15 个名字。
[1] "Sally Kirkland" "Anthony Ramirez" "Ray LePere" "Steve Reed" "Jodie Farber" "Columbia Dubose"
[7] "Randy Means" "Kevin Costner" "Jay O. Sanders" "E.J. Morris" "Cheryl Penland" "Jim Gough"
[13] "Perry R. Russo" "Mike Longman" "Edward Asner"
为什么名单 t运行 被分类了?
我应该如何调整我的代码以获得电影中 actors/actresses 的完整列表?
从 html_nodes
获取名称后,您需要进行一些数据清理
url <- "https://www.imdb.com/title/tt0102138/fullcredits?ref_=tt_ql_1"
library(rvest)
url %>%
read_html() %>%
html_nodes("td:nth-child(2)") %>%
html_text() %>%
grep("...", ., invert = TRUE, value = TRUE, fixed = TRUE) %>%
trimws %>%
.[. != ""]
# [1] "Sally Kirkland" "Anthony Ramirez" "Ray LePere"
# [4] "Steve Reed" "Jodie Farber" "Columbia Dubose"
# [7] "Randy Means" "Kevin Costner" "Jay O. Sanders"
# [10] "E.J. Morris" "Cheryl Penland" "Jim Gough"
# [13] "Perry R. Russo" "Mike Longman" "Edward Asner"
# [16] "Jack Lemmon" "Vincent D'Onofrio" "Gary Oldman"
#....
这是我所做的。如果你只是需要actors/actresses,你可以运行下面的代码。我确定了具体位置。这样就可以精确的得到actors/actresses的名字;无需字符串操作。
library(rvest)
library(stringi)
read_html("https://www.imdb.com/title/tt0102138/fullcredits?ref_=tt_ql_1") %>%
html_nodes("td.primary_photo") %>%
html_nodes("img") %>%
html_attr("alt")
# [1] "Sally Kirkland" "Anthony Ramirez" "Ray LePere" "Steve Reed"
# [5] "Jodie Farber" "Columbia Dubose" "Randy Means" "Kevin Costner"
#[249] "Mark Edward Walters" "Earl Warren" "John B. Wells" "Jim White"
#[253] "Phillip L. Willis" "Rosemary Willis" "Louis Steven Witt" "Angus G. Wynne III"
作为奖励,如果你想创建一个包含名字和角色名字的数据框,你可以尝试以下方法。
mydf <- tibble(actors = read_html("https://www.imdb.com/title/tt0102138/fullcredits?ref_=tt_ql_1") %>%
html_nodes("td.primary_photo") %>%
html_nodes("img") %>%
html_attr("alt"),
characters = read_html("https://www.imdb.com/title/tt0102138/fullcredits?ref_=tt_ql_1") %>%
html_nodes(".character") %>%
html_text() %>%
stri_replace_all_regex(pattern = "\n|\s{2,}", replacement = ""))
# actors characters
# <chr> <chr>
# 1 Sally Kirkland Rose Cheramie
# 2 Anthony Ramirez Epileptic
# 3 Ray LePere Zapruder
# 4 Steve Reed John F. Kennedy - Double
# 5 Jodie Farber Jackie Kennedy - Double(as Jodi Farber)
# 6 Columbia Dubose Nellie Connally - Double
# 7 Randy Means Gov. Connally - Double
# 8 Kevin Costner Jim Garrison
# 9 Jay O. Sanders Lou Ivon
#10 E.J. Morris Plaza Witness #1
使用 rvest 包,我试图从电影肯尼迪 (https://www.imdb.com/title/tt0102138/fullcredits?ref_=tt_ql_1) 的 IMDB 页面中抓取 actors/actresses 的名字。
SelectorGadget 说我想找的地方是每个人的 "td:nth-child(2)"。
这是我正在使用的代码。
library(rvest)
library(stringr)
startFilm <- "tt0102138" #JFK
personsNames <- c()
pagePath <- paste("https://www.imdb.com/title/", startFilm, "/?ref_=nv_sr_1?ref_=nv_sr_1", sep = "")
moviePage <- read_html(pagePath)
personNodes <- html_nodes(moviePage, "td:nth-child(2)")
personText <- html_text(personNodes)
for (i in 1:length(personText)){
actor <- (unlist(str_split(personText[i], "\n")))[2]
personsNames[i] <- substring(actor, 2, nchar(actor))
}
personsNames
根据 https://www.imdb.com/title/tt0102138/fullcredits?ref_=tt_ql_1 的网站,此列表应该相当长。
然而,当我 运行 代码时,我只得到 15 个名字。
[1] "Sally Kirkland" "Anthony Ramirez" "Ray LePere" "Steve Reed" "Jodie Farber" "Columbia Dubose"
[7] "Randy Means" "Kevin Costner" "Jay O. Sanders" "E.J. Morris" "Cheryl Penland" "Jim Gough"
[13] "Perry R. Russo" "Mike Longman" "Edward Asner"
为什么名单 t运行 被分类了?
我应该如何调整我的代码以获得电影中 actors/actresses 的完整列表?
从 html_nodes
url <- "https://www.imdb.com/title/tt0102138/fullcredits?ref_=tt_ql_1"
library(rvest)
url %>%
read_html() %>%
html_nodes("td:nth-child(2)") %>%
html_text() %>%
grep("...", ., invert = TRUE, value = TRUE, fixed = TRUE) %>%
trimws %>%
.[. != ""]
# [1] "Sally Kirkland" "Anthony Ramirez" "Ray LePere"
# [4] "Steve Reed" "Jodie Farber" "Columbia Dubose"
# [7] "Randy Means" "Kevin Costner" "Jay O. Sanders"
# [10] "E.J. Morris" "Cheryl Penland" "Jim Gough"
# [13] "Perry R. Russo" "Mike Longman" "Edward Asner"
# [16] "Jack Lemmon" "Vincent D'Onofrio" "Gary Oldman"
#....
这是我所做的。如果你只是需要actors/actresses,你可以运行下面的代码。我确定了具体位置。这样就可以精确的得到actors/actresses的名字;无需字符串操作。
library(rvest)
library(stringi)
read_html("https://www.imdb.com/title/tt0102138/fullcredits?ref_=tt_ql_1") %>%
html_nodes("td.primary_photo") %>%
html_nodes("img") %>%
html_attr("alt")
# [1] "Sally Kirkland" "Anthony Ramirez" "Ray LePere" "Steve Reed"
# [5] "Jodie Farber" "Columbia Dubose" "Randy Means" "Kevin Costner"
#[249] "Mark Edward Walters" "Earl Warren" "John B. Wells" "Jim White"
#[253] "Phillip L. Willis" "Rosemary Willis" "Louis Steven Witt" "Angus G. Wynne III"
作为奖励,如果你想创建一个包含名字和角色名字的数据框,你可以尝试以下方法。
mydf <- tibble(actors = read_html("https://www.imdb.com/title/tt0102138/fullcredits?ref_=tt_ql_1") %>%
html_nodes("td.primary_photo") %>%
html_nodes("img") %>%
html_attr("alt"),
characters = read_html("https://www.imdb.com/title/tt0102138/fullcredits?ref_=tt_ql_1") %>%
html_nodes(".character") %>%
html_text() %>%
stri_replace_all_regex(pattern = "\n|\s{2,}", replacement = ""))
# actors characters
# <chr> <chr>
# 1 Sally Kirkland Rose Cheramie
# 2 Anthony Ramirez Epileptic
# 3 Ray LePere Zapruder
# 4 Steve Reed John F. Kennedy - Double
# 5 Jodie Farber Jackie Kennedy - Double(as Jodi Farber)
# 6 Columbia Dubose Nellie Connally - Double
# 7 Randy Means Gov. Connally - Double
# 8 Kevin Costner Jim Garrison
# 9 Jay O. Sanders Lou Ivon
#10 E.J. Morris Plaza Witness #1