Error: 'NA' does not exist in current working directory (Webscraping)
Error: 'NA' does not exist in current working directory (Webscraping)
我正在尝试从以下 url- 网络抓取数据:
https://university.careers360.com/colleges/list-of-degree-colleges-in-India
我想点击每所大学的名称并获取每所大学的特定数据。
首先我做的是收集所有大学 url 的向量-:
#loading the package:
library(xml2)
library(rvest)
library(stringr)
library(dplyr)
#Specifying the url for desired website to be scrapped
baseurl <- "https://university.careers360.com/colleges/list-of-degree-colleges-in-India"
#Reading the html content from Amazon
basewebpage <- read_html(baseurl)
#Extracting college name and its url
scraplinks <- function(url){
#Create an html document from the url
webpage <- xml2::read_html(url)
#Extract the URLs
url_ <- webpage %>%
rvest::html_nodes(".title a") %>%
rvest::html_attr("href")
#Extract the link text
link_ <- webpage %>%
rvest::html_nodes(".title a") %>%
rvest::html_text()
return(data_frame(link = link_, url = url_))
}
#College names and Urls
allcollegeurls<-scraplinks(baseurl)
到目前为止工作正常,但是当我对每个 url 使用 read_html 时,它显示错误。
#Reading the each url
for (i in allcollegeurls$url) {
clgwebpage <- read_html(allcollegeurls$url[i])
}
Error: 'NA' does not exist in current working directory ('C:/Users/User/Documents').
我什至使用了 'break' 命令,但仍然出现同样的错误-:
#Reading the each url
for (i in allcollegeurls$url) {
clgwebpage <- read_html(allcollegeurls$url[i])
if(is.na(allcollegeurls$url[i]))break
}
请帮忙。
根据要求发布 allcollegeurls 的 str-:
> str(allcollegeurls)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 30 obs. of 2 variables:
$ link: chr "Netaji Subhas Institute of Technology, Delhi" "Hansraj
College, Delhi" "School of Business, University of Petroleum and Energy
Studies, D.." "Hindu College, Delhi" ...
$ url : chr "https://www.careers360.com/university/netaji-subhas-
university-of-technology-new-delhi"
"https://www.careers360.com/colleges/hansraj-college-delhi"
"https://www.careers360.com/colleges/school-of-business-university-of-
petroleum-and-energy-studies-dehradun"
"https://www.careers360.com/colleges/hindu-college-delhi" ...
这项工作,
purrr::map(allcollegeurls$url, read_html)
map 函数:map 函数通过对每个元素应用函数并返回与输入长度相同的向量来转换其输入。我喜欢避免 for
在 R 中使用
今天我的数据面临着几乎相同的问题。
请从 url.
中删除任何 NA
在我的例子中,错误是
Error: ' ' does not exist in current working directory.
我从应用了该函数的列中删除了空白并且它起作用了。
上面的错误说明有NA
个函数不能应用
我正在尝试从以下 url- 网络抓取数据: https://university.careers360.com/colleges/list-of-degree-colleges-in-India 我想点击每所大学的名称并获取每所大学的特定数据。
首先我做的是收集所有大学 url 的向量-:
#loading the package:
library(xml2)
library(rvest)
library(stringr)
library(dplyr)
#Specifying the url for desired website to be scrapped
baseurl <- "https://university.careers360.com/colleges/list-of-degree-colleges-in-India"
#Reading the html content from Amazon
basewebpage <- read_html(baseurl)
#Extracting college name and its url
scraplinks <- function(url){
#Create an html document from the url
webpage <- xml2::read_html(url)
#Extract the URLs
url_ <- webpage %>%
rvest::html_nodes(".title a") %>%
rvest::html_attr("href")
#Extract the link text
link_ <- webpage %>%
rvest::html_nodes(".title a") %>%
rvest::html_text()
return(data_frame(link = link_, url = url_))
}
#College names and Urls
allcollegeurls<-scraplinks(baseurl)
到目前为止工作正常,但是当我对每个 url 使用 read_html 时,它显示错误。
#Reading the each url
for (i in allcollegeurls$url) {
clgwebpage <- read_html(allcollegeurls$url[i])
}
Error: 'NA' does not exist in current working directory ('C:/Users/User/Documents').
我什至使用了 'break' 命令,但仍然出现同样的错误-:
#Reading the each url
for (i in allcollegeurls$url) {
clgwebpage <- read_html(allcollegeurls$url[i])
if(is.na(allcollegeurls$url[i]))break
}
请帮忙。
根据要求发布 allcollegeurls 的 str-:
> str(allcollegeurls)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 30 obs. of 2 variables:
$ link: chr "Netaji Subhas Institute of Technology, Delhi" "Hansraj
College, Delhi" "School of Business, University of Petroleum and Energy
Studies, D.." "Hindu College, Delhi" ...
$ url : chr "https://www.careers360.com/university/netaji-subhas-
university-of-technology-new-delhi"
"https://www.careers360.com/colleges/hansraj-college-delhi"
"https://www.careers360.com/colleges/school-of-business-university-of-
petroleum-and-energy-studies-dehradun"
"https://www.careers360.com/colleges/hindu-college-delhi" ...
这项工作,
purrr::map(allcollegeurls$url, read_html)
map 函数:map 函数通过对每个元素应用函数并返回与输入长度相同的向量来转换其输入。我喜欢避免 for
在 R 中使用
今天我的数据面临着几乎相同的问题。 请从 url.
中删除任何NA
在我的例子中,错误是
Error: ' ' does not exist in current working directory.
我从应用了该函数的列中删除了空白并且它起作用了。
上面的错误说明有NA
个函数不能应用