使用 rvest 进行网络抓取，错误 "no applicable method for 'xml_find_first' applied to an object of class " 个字符“”

Question

我正在尝试使用 rvest 包从网页中抓取职位名称，但出现错误：

Error in UseMethod("xml_find_first") : 
  no applicable method for 'xml_find_first' applied to an object of class "character"

有什么建议吗？我是否遗漏了部分代码？我的代码如下：

library(dplyr)
library(rvest)
library(stringr)

url <- "https://www.cvmarket.lt/darbo-skelbimai"
# save the url
html <- read_html(url) # read the url 

get_links <- function(html) {
  html %>%
    html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "limited-lines", " " ))]') %>%
    html_attr(name = "href")
}
# now we call the function and save it
links <- get_links(html)
links

links <- paste0("https://www.cvmarket.lt", links)

link <- links[1]
html <- read_html(link)


# position title
get_title <- function(html) {
  html %>%
    html_node(xpath = '//*[(@id = "main-job-title")]') %>%
    html_text() %>%
    unlist()
}
#test
get_title(link)

Answer 1

我会更改您的函数以接受 uri 作为输入参数。使用更快的 css 选择器和更具体的选择器，以便在使用 id 的情况下获得更快的匹配，并且在 css class 组合的情况下没有重复。您可以使用 url_absolute 在 get_links 函数中完成 urls。这也处理当前错误，您将 url 而不是 html 传递给 get_title 函数，然后在其上调用 read_html。

library(dplyr)
library(rvest)
library(stringr)

get_links <- function(url) {
  read_html(url) %>%
    html_nodes('.main-column > .f_job_title') %>%
    html_attr(name = "href") %>% url_absolute(url)
}

# position title
get_title <- function(link) {
  read_html(link) %>%
    html_node('#main-job-title') %>%
    html_text() 
}


url <- "https://www.cvmarket.lt/darbo-skelbimai"
links <- get_links(url)
link <- links[1]

#test
get_title(link)

使用 rvest 进行网络抓取，错误 "no applicable method for 'xml_find_first' applied to an object of class " 个字符“”

Webscraping with rvest, error "no applicable method for 'xml_find_first' applied to an object of class "character""

r

web-scraping

rvest