PhantomJS 的下一页

Next page by PhantomJS

我想抓取来自 https://www.vietnamworks.com/job-search/all-jobs 的所有链接 href。

我发现网站使用了javascript来渲染内容,所以我使用R中的phantomjs进行抓取,但我只能抓取第1页。

如何点击下一页并抓取所有其余链接。

假设数据是您想要的...还有另一种方法可以实现。如果您右键单击 chrome 中的页面并检查网络调用,您可以找到该站点正在进行的 API 调用以检索数据本身。每次调用都会产生 50 个结果,并且看起来最多总共有 5000 个结果,因此当我测试时,函数中的 max-page 参数将在 96 左右...

.job_api <- function(page = 0){
  library(stringi)
  library(httr)
  # site url
  # 
  url <- "https://jf8q26wwud-dsn.algolia.net/1/indexes/*/queries?"
  # They put request headers into their query string directly
  string_heads <- c(
    "x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%20(lite)%203.24.5%3Binstantsearch.js%201.6.0%3BJS%20Helper%202.21.2",
    "x-algolia-application-id=JF8Q26WWUD",
    "x-algolia-api-key=M2UzZmI1Zjc1NGMwZmYzZjJiNWE0ZTgxMzNjNmIzMjc2ODEyZWQwZTJmYzNjMDhjNmU3NGQ3ZGViMzJiZTlkNHRhZ0ZpbHRlcnM9JnVzZXJUb2tlbj00ODBiNjRhNzI2NjQ3ODgwMThmNDhjZWNkYmVhNGVlYg%3D%3D"
  )

  api_url <- stri_join(c(url, stri_join(string_heads, collapse = "&")), collapse = "")

  # form body data
  body_part <- '{"requests":[{"indexName":"vnw_job_v2","params":"query=&hitsPerPage=50&maxValuesPerFacet=20&page=0&restrictSearchableAttributes=%5B%22jobTitle%22%2C%22skills%22%2C%22company%22%5D&facets=%5B%22categoryIds%22%2C%22locationIds%22%2C%22categories%22%2C%22locations%22%2C%22skills%22%2C%22jobLevel%22%2C%22company%22%5D&tagFilters="}]}'
  # replace the body of the form data request with regex.. this is ugly but quick
  body_post <- stri_replace_all_regex(body_part, "(?<=page\=)[0-9]+", page)

  # Make the api call
  call <- POST(api_url, body = body_post)
  # if pass... return data or else fail with the response information
  if(status_code(call) == 200L){
    content(call)
  }else {
    return(call)
  }

}

下面是一些输出的样子。

> test <- .job_api(0)
> length(test$results[[1]]$hits)
[1] 50
> names(test$results[[1]]$hits[[50]]$`_highlightResult`)
[1] "jobTitle"       "skills"         "company"        "jobDescription" "jobRequirement"
> test$results[[1]]$hits[[5]]$`_highlightResult`$skills[[1]]
$value
[1] "Process System Engineering"

$matchLevel
[1] "none"

$matchedWords
list()