使用 rvest 进行网络抓取,错误 "no applicable method for 'xml_find_first' applied to an object of class " 个字符“”
Webscraping with rvest, error "no applicable method for 'xml_find_first' applied to an object of class "character""
我正在尝试使用 rvest 包从网页中抓取职位名称,但出现错误:
Error in UseMethod("xml_find_first") :
no applicable method for 'xml_find_first' applied to an object of class "character"
有什么建议吗?我是否遗漏了部分代码?我的代码如下:
library(dplyr)
library(rvest)
library(stringr)
url <- "https://www.cvmarket.lt/darbo-skelbimai"
# save the url
html <- read_html(url) # read the url
get_links <- function(html) {
html %>%
html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "limited-lines", " " ))]') %>%
html_attr(name = "href")
}
# now we call the function and save it
links <- get_links(html)
links
links <- paste0("https://www.cvmarket.lt", links)
link <- links[1]
html <- read_html(link)
# position title
get_title <- function(html) {
html %>%
html_node(xpath = '//*[(@id = "main-job-title")]') %>%
html_text() %>%
unlist()
}
#test
get_title(link)
我会更改您的函数以接受 uri 作为输入参数。使用更快的 css 选择器和更具体的选择器,以便在使用 id 的情况下获得更快的匹配,并且在 css class 组合的情况下没有重复。您可以使用 url_absolute 在 get_links 函数中完成 urls。这也处理当前错误,您将 url 而不是 html 传递给 get_title 函数,然后在其上调用 read_html。
library(dplyr)
library(rvest)
library(stringr)
get_links <- function(url) {
read_html(url) %>%
html_nodes('.main-column > .f_job_title') %>%
html_attr(name = "href") %>% url_absolute(url)
}
# position title
get_title <- function(link) {
read_html(link) %>%
html_node('#main-job-title') %>%
html_text()
}
url <- "https://www.cvmarket.lt/darbo-skelbimai"
links <- get_links(url)
link <- links[1]
#test
get_title(link)
我正在尝试使用 rvest 包从网页中抓取职位名称,但出现错误:
Error in UseMethod("xml_find_first") :
no applicable method for 'xml_find_first' applied to an object of class "character"
有什么建议吗?我是否遗漏了部分代码?我的代码如下:
library(dplyr)
library(rvest)
library(stringr)
url <- "https://www.cvmarket.lt/darbo-skelbimai"
# save the url
html <- read_html(url) # read the url
get_links <- function(html) {
html %>%
html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "limited-lines", " " ))]') %>%
html_attr(name = "href")
}
# now we call the function and save it
links <- get_links(html)
links
links <- paste0("https://www.cvmarket.lt", links)
link <- links[1]
html <- read_html(link)
# position title
get_title <- function(html) {
html %>%
html_node(xpath = '//*[(@id = "main-job-title")]') %>%
html_text() %>%
unlist()
}
#test
get_title(link)
我会更改您的函数以接受 uri 作为输入参数。使用更快的 css 选择器和更具体的选择器,以便在使用 id 的情况下获得更快的匹配,并且在 css class 组合的情况下没有重复。您可以使用 url_absolute 在 get_links 函数中完成 urls。这也处理当前错误,您将 url 而不是 html 传递给 get_title 函数,然后在其上调用 read_html。
library(dplyr)
library(rvest)
library(stringr)
get_links <- function(url) {
read_html(url) %>%
html_nodes('.main-column > .f_job_title') %>%
html_attr(name = "href") %>% url_absolute(url)
}
# position title
get_title <- function(link) {
read_html(link) %>%
html_node('#main-job-title') %>%
html_text()
}
url <- "https://www.cvmarket.lt/darbo-skelbimai"
links <- get_links(url)
link <- links[1]
#test
get_title(link)