在使用 R 中的 html_node 提取第一个 google 搜索结果时需要帮助
need help in extracting the first google search result using html_node in R
我有一个医院名称列表,我需要从中提取第一个 google 搜索 URL。这是我使用的代码
library(rvest)
library(urltools)
library(RCurl)
library(httr)
getWebsite <- function(name)
{
url = URLencode(paste0("https://www.google.com/search?q=",name))
page <- read_html(url)
results <- page %>%
html_nodes("cite") %>%
html_text()
result <- results[1]
return(as.character(result))}
websites <- data.frame(Website = sapply(c,getWebsite))
View(websites)
对于短 URLs 此代码工作正常但是当 link 很长并且出现在 R 中时带有“...”(例如 www.medicine.northwestern.edu/divisions/allergy-immunology/.../fellowship.html) 它以与“...”相同的方式出现在数据框中。如何在没有“...”的情况下提取实际的 URLs?感谢您的帮助!
This is a working example, tested on my computer:
library("rvest")
# Load the page
main.page <- read_html(x = "https://www.google.com/search?q=software%20programming")
links <- main.page %>%
html_nodes(".r a") %>% # get the a nodes with an r class
html_attr("href") # get the href attributes
#clean the text
links = gsub('/url\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))
# as a dataframe
websites <- data.frame(links = links, stringsAsFactors = FALSE)
View(websites)
我有一个医院名称列表,我需要从中提取第一个 google 搜索 URL。这是我使用的代码
library(rvest)
library(urltools)
library(RCurl)
library(httr)
getWebsite <- function(name)
{
url = URLencode(paste0("https://www.google.com/search?q=",name))
page <- read_html(url)
results <- page %>%
html_nodes("cite") %>%
html_text()
result <- results[1]
return(as.character(result))}
websites <- data.frame(Website = sapply(c,getWebsite))
View(websites)
对于短 URLs 此代码工作正常但是当 link 很长并且出现在 R 中时带有“...”(例如 www.medicine.northwestern.edu/divisions/allergy-immunology/.../fellowship.html) 它以与“...”相同的方式出现在数据框中。如何在没有“...”的情况下提取实际的 URLs?感谢您的帮助!
This is a working example, tested on my computer:
library("rvest")
# Load the page
main.page <- read_html(x = "https://www.google.com/search?q=software%20programming")
links <- main.page %>%
html_nodes(".r a") %>% # get the a nodes with an r class
html_attr("href") # get the href attributes
#clean the text
links = gsub('/url\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))
# as a dataframe
websites <- data.frame(links = links, stringsAsFactors = FALSE)
View(websites)