在使用 R 中的 html_node 提取第一个 google 搜索结果时需要帮助

Question

我有一个医院名称列表，我需要从中提取第一个 google 搜索 URL。这是我使用的代码

library(rvest)
library(urltools)
library(RCurl)
library(httr)
getWebsite <- function(name)
{
 url = URLencode(paste0("https://www.google.com/search?q=",name))

 page <- read_html(url)

 results <- page %>% 
     html_nodes("cite") %>%
     html_text()

 result <- results[1]

 return(as.character(result))}
websites <- data.frame(Website = sapply(c,getWebsite))
View(websites)

对于短 URLs 此代码工作正常但是当 link 很长并且出现在 R 中时带有“...”（例如 www.medicine.northwestern.edu/divisions/allergy-immunology/.../fellowship.html) 它以与“...”相同的方式出现在数据框中。如何在没有“...”的情况下提取实际的 URLs？感谢您的帮助！

Answer 1

This is a working example, tested on my computer:

library("rvest")
# Load the page
main.page <- read_html(x = "https://www.google.com/search?q=software%20programming")
links <- main.page %>% 
  html_nodes(".r a") %>% # get the a nodes with an r class
  html_attr("href") # get the href attributes
#clean the text  
links = gsub('/url\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))
# as a dataframe
websites <- data.frame(links = links, stringsAsFactors = FALSE)
View(websites)

在使用 R 中的 html_node 提取第一个 google 搜索结果时需要帮助

need help in extracting the first google search result using html_node in R

r

extract

hyperlink

google-search