使用 rvest 抓取时在缺少值的地方输入 NA
Inputting NA where there are missing values when scraping with rvest
我想用 rvest
抓取一个页面,其中有标题和 运行 次在最近的会议上的演讲,然后将这些值组合成 tibble
library(tibble)
library(rvest)
url <- "https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference?sort=status&direction=desc&page=14"
title <- page %>%
html_nodes("h3 a") %>%
html_text()
length <- page %>%
html_nodes(".tile .caption") %>%
html_text()
df <- tibble(title,length)
如果您查看该页面,您会发现其中一个演讲没有任何价值 - 而在查看源代码中,这个演讲 class="caption"
没有价值
有什么方法可以替换 NA
来显示缺失值吗?
最简单的方法是 select 一个包含每行所需节点的节点,然后遍历它们,一次拉出所需的两个节点。 purrr::map_df
不仅可以方便地进行迭代,甚至可以将结果组合成一个漂亮的小标题:
library(rvest)
library(purrr)
url <- "https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference?sort=status&direction=desc&page=14"
page <- read_html(url)
df <- page %>%
html_nodes('article') %>% # select enclosing nodes
# iterate over each, pulling out desired parts and coerce to data.frame
map_df(~list(title = html_nodes(.x, 'h3 a') %>%
html_text() %>%
{if(length(.) == 0) NA else .}, # replace length-0 elements with NA
length = html_nodes(.x, '.tile .caption') %>%
html_text() %>%
{if(length(.) == 0) NA else .}))
df
#> # A tibble: 12 x 2
#> title length
#> <chr> <chr>
#> 1 Introduction to Natural Language Processing with R II 01:15:00
#> 2 Introduction to Natural Language Processing with R 01:22:13
#> 3 Solving iteration problems with purrr II 01:22:49
#> 4 Solving iteration problems with purrr 01:32:23
#> 5 Markov-Switching GARCH Models in R: The MSGARCH Package 15:55
#> 6 Interactive bullwhip effect exploration using SCperf and Shiny 16:02
#> 7 Actuarial and statistical aspects of reinsurance in R 14:15
#> 8 Transformation Forests 16:19
#> 9 Room 2.02 Lightning Talks 50:35
#> 10 R and Haskell: Combining the best of two worlds 14:45
#> 11 *GNU R* on a Programmable Logic Controller (PLC) in an Embedded-Linux Environment <NA>
#> 12 Performance Benchmarking of the R Programming Environment on Knight's Landing 19:32
我遇到了同样的问题,但我设法在不涉及包含两个指定节点的节点的情况下使其工作。
使用您的代码将是:
library(tibble)
library(rvest)
url <- "https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference?sort=status&direction=desc&page=14"
title <- page %>%
html_nodes("h3 a") %>%
html_text() %>%
{if(length(.) == 0) NA else .
length <- page %>%
html_nodes(".tile .caption") %>%
html_text() %>%
{if(length(.) == 0) NA else .
df <- tibble(title,length)
我想用 rvest
抓取一个页面,其中有标题和 运行 次在最近的会议上的演讲,然后将这些值组合成 tibble
library(tibble)
library(rvest)
url <- "https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference?sort=status&direction=desc&page=14"
title <- page %>%
html_nodes("h3 a") %>%
html_text()
length <- page %>%
html_nodes(".tile .caption") %>%
html_text()
df <- tibble(title,length)
如果您查看该页面,您会发现其中一个演讲没有任何价值 - 而在查看源代码中,这个演讲 class="caption"
没有价值
有什么方法可以替换 NA
来显示缺失值吗?
最简单的方法是 select 一个包含每行所需节点的节点,然后遍历它们,一次拉出所需的两个节点。 purrr::map_df
不仅可以方便地进行迭代,甚至可以将结果组合成一个漂亮的小标题:
library(rvest)
library(purrr)
url <- "https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference?sort=status&direction=desc&page=14"
page <- read_html(url)
df <- page %>%
html_nodes('article') %>% # select enclosing nodes
# iterate over each, pulling out desired parts and coerce to data.frame
map_df(~list(title = html_nodes(.x, 'h3 a') %>%
html_text() %>%
{if(length(.) == 0) NA else .}, # replace length-0 elements with NA
length = html_nodes(.x, '.tile .caption') %>%
html_text() %>%
{if(length(.) == 0) NA else .}))
df
#> # A tibble: 12 x 2
#> title length
#> <chr> <chr>
#> 1 Introduction to Natural Language Processing with R II 01:15:00
#> 2 Introduction to Natural Language Processing with R 01:22:13
#> 3 Solving iteration problems with purrr II 01:22:49
#> 4 Solving iteration problems with purrr 01:32:23
#> 5 Markov-Switching GARCH Models in R: The MSGARCH Package 15:55
#> 6 Interactive bullwhip effect exploration using SCperf and Shiny 16:02
#> 7 Actuarial and statistical aspects of reinsurance in R 14:15
#> 8 Transformation Forests 16:19
#> 9 Room 2.02 Lightning Talks 50:35
#> 10 R and Haskell: Combining the best of two worlds 14:45
#> 11 *GNU R* on a Programmable Logic Controller (PLC) in an Embedded-Linux Environment <NA>
#> 12 Performance Benchmarking of the R Programming Environment on Knight's Landing 19:32
我遇到了同样的问题,但我设法在不涉及包含两个指定节点的节点的情况下使其工作。
使用您的代码将是:
library(tibble)
library(rvest)
url <- "https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference?sort=status&direction=desc&page=14"
title <- page %>%
html_nodes("h3 a") %>%
html_text() %>%
{if(length(.) == 0) NA else .
length <- page %>%
html_nodes(".tile .caption") %>%
html_text() %>%
{if(length(.) == 0) NA else .
df <- tibble(title,length)