Rvest 抓取 google 行数不同的新闻

Question

我正在使用 Rvest 抓取 google 新闻。
但是，我时常在不同的关键字上遇到元素“时间”中的缺失值。由于缺少值，因此对于抓取结果的数据框，最终会出现“不同行数错误”。
这些缺失值有没有办法填NA？

下面是我使用的代码示例。

html_dat <- read_html(paste0("https://news.google.com/search?q=",Search,"&hl=en-US&gl=US&ceid=US%3Aen"))

  dat <- data.frame(Link = html_dat %>%
                  html_nodes('.VDXfz') %>% 
                  html_attr('href')) %>% 
  mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))

  news_dat <- data.frame(
   Title = html_dat %>%
   html_nodes('.DY5T1d') %>% 
   html_text(),
   Link = dat$Link,
   Description =  html_dat %>%
   html_nodes('.Rai5ob') %>% 
   html_text(),
   Time =  html_dat %>%
   html_nodes('.WW6dff') %>%
   html_text() 
 )

Answer 1

在不知道您正在查看的确切页面的情况下，我尝试了第一个 Google 新闻页面。

在 Rvest 页面中，html_node（没有 s）将始终 return 一个值，即使它是 NA。因此，为了保持向量的长度相同，需要为所有所需数据节点找到公共父节点。然后从这些节点中的每一个解析所需的信息。

假设 Title 节点最完整，使用 xml_parent() 上升 1 级并尝试检索相同数量的描述节点，这没有用。然后使用 xml_parent() %>% xml_parent() 尝试了 2 个级别，这似乎有效。

library(rvest)

url <-"https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en"
html_dat <- read_html(url)

Title = html_dat %>%  html_nodes('.DY5T1d') %>%   html_text()

# Link = dat$Link
Link = html_dat %>%  html_nodes('.VDXfz') %>%   html_attr('href') 
Link <-  gsub("./articles/", "https://news.google.com/articles/",Link)

#Find the common parent node 
#(this was trial and error) Tried the parent then the grandparent
Titlenodes <- html_dat %>%  html_nodes('.DY5T1d') %>% xml_parent()  %>% xml_parent() 
Description =  Titlenodes %>%  html_node('.Rai5ob') %>%  html_text()
Time =  Titlenodes %>%  html_node('.WW6dff') %>%   html_text() 
 
answer <- data.frame(Title, Time, Description, Link)

Rvest 抓取 google 行数不同的新闻

Rvest scraping google news with different number of rows

r

web-scraping

google-news

rvest