R 中的 Xpath 表达式给出与 chrome 检查器不同的结果

Xpath expression in R giving different result than the chrome inspector

使用下面给出的xpath,从各个页面获取日期内容,我得到了想要的结果。但是这个页面具体来说,“http://eventsgeneva.strikingly.com//blog/agenda-geneve-something-you-should-never-miss”,在使用 chrome 检查器时给出了期望的结果,而在 R 中使用相同的 xpath 没有结果。


在 chrome.

中使用下面的 xpath
xpath = '((//h1/parent::*/following::*|//h1/ancestor::*[position()<3]/descendant-or-self::*)[position()<150 and (string-length(text())<150 and (contains(text(), "Jan") or contains(text(), "Feb") or contains(text(), "Mar") or contains(text(), "Apr") or contains(text(), "May") or contains(text(), "Jun") or contains(text(), "Jul") or contains(text(), "Aug") or contains(text(), "Sep") or contains(text(), "Oct") or contains(text(), "Nov") or contains(text(), "Dec")))])'  

我明白了,

同时使用库 "xml2" 在 R 中使用相同的 xpath。

我得到节点集 0

library(dplyr)

library(xml2)

html_page<-read_html("http://eventsgeneva.strikingly.com//blog/agenda-geneve-something-you-should-never-miss")

html_page%>%
  xml_find_all(xpath = '((//h1/parent::*/following::*|//h1/ancestor::*[position()<3]/descendant-or-self::*)[position()<150 and (string-length(text())<150 and (contains(text(), "Jan") or contains(text(), "Feb") or contains(text(), "Mar") or contains(text(), "Apr") or contains(text(), "May") or contains(text(), "Jun") or contains(text(), "Jul") or contains(text(), "Aug") or contains(text(), "Sep") or contains(text(), "Oct") or contains(text(), "Nov") or contains(text(), "Dec")))])')
#> {xml_nodeset (0)}

我错过了什么吗?

由以上内容推断:

使用decapitated

library(rvest)
library(decapitated)
library(tidyverse)

doc <- decapitated::chrome_read_html("http://eventsgeneva.strikingly.com//blog/agenda-geneve-something-you-should-never-miss")

html_nodes(doc, xpath = '((//h1/parent::*/following::*|//h1/ancestor::*[position()<3]/descendant-or-self::*)[position()<150 and (string-length(text())<150 and (contains(text(), "Jan") or contains(text(), "Feb") or contains(text(), "Mar") or contains(text(), "Apr") or contains(text(), "May") or contains(text(), "Jun") or contains(text(), "Jul") or contains(text(), "Aug") or contains(text(), "Sep") or contains(text(), "Oct") or contains(text(), "Nov") or contains(text(), "Dec")))])')
## {xml_nodeset (1)}
## [1] <span class="s-blog-date">August 4, 2018</span>

请根据需要阅读 README 和 pkg 文档 Chrome(最好是包中解释的单独的 Chromium 二进制文件)和环境变量设置,您必须自己调试任何设置问题.

使用splashr

splashr 包需要 reticulate 包,Docker 和 Python docker 模块。如果您 运行 遇到问题,请进行更多自我调试:

library(rvest)
library(splashr)
library(tidyverse)

sp <- splashr::start_splash()

doc <- render_html(splash_local, "http://eventsgeneva.strikingly.com//blog/agenda-geneve-something-you-should-never-miss")

html_nodes(doc, xpath = '((//h1/parent::*/following::*|//h1/ancestor::*[position()<3]/descendant-or-self::*)[position()<150 and (string-length(text())<150 and (contains(text(), "Jan") or contains(text(), "Feb") or contains(text(), "Mar") or contains(text(), "Apr") or contains(text(), "May") or contains(text(), "Jun") or contains(text(), "Jul") or contains(text(), "Aug") or contains(text(), "Sep") or contains(text(), "Oct") or contains(text(), "Nov") or contains(text(), "Dec")))])')
## {xml_nodeset (1)}
## [1] <span class="s-blog-date">August 4, 2018</span>

killall_splash()

使用V8

为避免使用外部程序,您可以使用 V8 来处理页面变量并获取内容:

library(rvest)
library(V8)
library(tidyverse)

ctx <- v8()

doc <- read_html("http://eventsgeneva.strikingly.com//blog/agenda-geneve-something-you-should-never-miss")

html_nodes(doc, xpath=".//script")[[1]] %>% # get 1st <script>
  html_text() %>% # get contents of it
  str_replace(regex("^.*window\.", multiline=TRUE), "var $S = {};\n") %>% # make the variable usable in V8
  ctx$eval() # evaluate the javascript
## [1] "[object Object]"

pg <- ctx$get("$S") # marshall it to R

这是一个很大的结构,所以有条不紊地检查它:

str(pg, 1)
## List of 6
##  $ globalConf        :List of 26
##  $ conf              :List of 12
##  $ miniProgramAppType: NULL
##  $ blogPostData      :List of 5
##  $ siteData          :List of 5
##  $ stores            :List of 3

str(pg$blogPostData, 1)
## List of 5
##  $ blogPostMeta:List of 25
##  $ pageMeta    :List of 33
##  $ content     :List of 8
##  $ settings    :List of 2
##  $ pageMode    : NULL

str(pg$blogPostData$content, 1)
## List of 8
##  $ type            : chr "Blog.BlogData"
##  $ id              : chr "f_cc4ace2d-21ed-4b94-83a0-e83497e5afc4"
##  $ defaultValue    : NULL
##  $ showComments    : logi TRUE
##  $ showShareButtons: NULL
##  $ header          :List of 6
##  $ footer          :List of 5
##  $ sections        :'data.frame':    9 obs. of  4 variables:

内容好像在这里:

str(pg$blogPostData$content$sections)
## 'data.frame':    9 obs. of  4 variables:
##  $ type        : chr  "Blog.Section" "Blog.Section" "Blog.Section" "Blog.Section" ...
##  $ id          : chr  "f_9ca5a1d7-ccb8-4315-9883-bcd43d271b9c" "f_4b7b30f1-387c-4cbe-aaed-ddaedea92cc1" "f_252813ac-b6cb-484b-81f5-64d7f0745c8e" "f_bd7412a4-b94b-4c5a-8cdd-a48931639dce" ...
##  $ defaultValue: logi  NA NA NA NA NA NA ...
##  $ component   :'data.frame':    9 obs. of  6 variables:
##   ..$ type        : chr  "RichText" "RichText" "RichText" "RichText" ...
##   ..$ id          : chr  "f_4e41d6f3-8449-4f66-b701-28d1bcfb08c9" "f_c27703de-8679-4916-9697-220cb8c7a74d" "f_c3c20474-99fc-434a-aff1-102d2a342450" "f_7b3e5247-39ef-42c7-b95c-f0be0b6e9728" ...
##   ..$ defaultValue: logi  FALSE NA NA NA NA NA ...
##   ..$ value       : chr  "<p style=\"text-align: justify;\">We all make our plans beforehand in order to avoid any unnecessary issues. So"| __truncated__ "<p style=\"text-align: justify;\">Take a glance at the below-listed events and plan accordingly -</p>" "<p style=\"text-align: justify;\"><u>Siestes dominicales</u> – Here you are invited to groove on the grass and "| __truncated__ "<p style=\"text-align: justify;\"><u>Sonoboat ACT</u> – Neptune is one the most popular and historic sailing bo"| __truncated__ ...
##   ..$ backupValue : logi  NA NA NA NA NA NA ...
##   ..$ version     : int  1 NA NA NA NA NA NA 1 1

要么单独评估 value,要么将它们 paste0() 成一个 HTML 块并评估它。

顺便说一句,Strikingly 拥有我一段时间以来见过的最愚蠢且 low-content-integrity/safety 的发布解决方案之一。我知道你只是在抓取它,但我建议任何考虑使用它们的人不要使用它们。