使用 R 从使用 .JSF 搜索的页面转义数据

Question

我试图从瑞士行政法院抓取信息用于大学研究。

URL是：https://jurispub.admin.ch/publiws/pub/search.jsf 我对搜索完成后出现的 table 中列出的数据感兴趣。

很遗憾，没有 .robots.txt 文件。但是，该网页上的所有法令都对 public 开放。

我在 html-scraping 方面有一些经验，我查看了以下资源： http://www.rladiesnyc.org/post/scraping-javascript-websites-in-r/

https://www.r-bloggers.com/web-scraping-javascript-rendered-sites/

我的做法

我认为使用 PhantomJS 下载 html 版本的页面，然后抓取下载的网站是一个很好的方法。

我的问题

但是，如果在 https://jurispub.admin.ch/publiws/ 上执行 "empty" 搜索（通过单击 "suchen"，我不知道如何获取出现的页面的 url在搜索掩码中没有任何信息）给出了 57,294 个结果。我想到了类似的东西：

GET(url = "https://jurispub.admin.ch/publiws/",
      query=list(searchQuery=""))

然而，这不起作用。

另外，不知道要不要让PhantomJS"click"上小箭头按钮下载下一页。

Answer 1

添加外部依赖项很好，但真的应该是最后的手段 (IMO)。

如果您不熟悉浏览器中的开发人员工具视图，请在阅读此答案之前对其进行一些研究。在转到搜索页面以真正查看流程之前，您需要在新的浏览器会话中启动它。

GET 没有工作，因为它是一个 HTML 表单，并且 <form> 元素使用 POST 请求（在大多数情况下显示为 XHR 请求开发人员工具 Network 窗格）。但是，这是一个制作拙劣的网站，它本身就太复杂了（几乎比 Microsoft SharePoint 网站还差），并且在您开始搜索时有一些初始状态设置页面并在整个流程的其余部分进行维护。

我使用 curlconverter 对 POST XHR 请求进行分类。这样做的 TLDR 是右键单击任何 POST XHR 请求，找到 "Copy as cURL" 菜单项并 select 它。然后，在剪贴板上仍然存在的情况下，按照 curlconverter 的自述文件和手册页上的说明获取实际的 httr 函数。我真的不能保证会引导您完成这一部分或在这里回答 curlconverter 问题。

无论如何，要让 httr/curl 为您维护一些 cookie 和以获得您需要传递的关键会话变量每次调用我们都需要从一个新的 R 会话开始，"prime" 带有 GET 的抓取过程到主搜索 URL:

library(stringi) # Iprefer this for extracting matched strings
library(rvest)
library(httr)

primer <- httr::GET("https://jurispub.admin.ch/publiws/pub/search.jsf")

现在，我们需要提取位于该页面 javascript 中的会话字符串：

httr::content(primer, as="text") %>%
  stri_match_first_regex("session: '([[:alnum:]]+)'") %>% 
  .[,2] -> ice_session

现在，我们假设我们正在提交表单。可能不需要所有这些隐藏变量，但这是浏览器发送的内容。我通常会尝试将它们缩减到只需要的部分，但这是您的项目，所以如果您愿意，请尽情享受：

httr::POST(
  url = "https://jurispub.admin.ch/publiws/block/send-receive-updates",
  body = list(
    `$ice.submit.partial` = "true",
    ice.event.target = "form:_id64",
    ice.event.captured = "form:_id63first",
    ice.event.type = "onclick",
    ice.event.alt = "false",
    ice.event.ctrl = "false",
    ice.event.shift = "false",
    ice.event.meta = "false",
    ice.event.x = "51", 
    ice.event.y = "336",
    ice.event.left = "true",
    ice.event.right = "false",
    form = "form", 
    icefacesCssUpdates = "",
    `form:_id63` = "first",
    `form:_idcl` = "form:_id63first",
    ice.session = ice_session,
    ice.view = "1", 
    ice.focus = "form:_id63first",
    rand = "0.38654987905551663\n\n"
  ),
  encode = "form"
) -> first_pg

现在我们有了第一页，我们需要从中获取数据。我不打算完全解决这个问题，但你应该能够从下面的内容中推断出来。 POST 请求 returns XML 页面上的 javascript 变成了难看的 table。我们要提取 table:

httr::content(first_pg) %>% 
  xml_find_first("//updates/update/content") %>% 
  xml_text() %>% 
  read_html() -> pg_tbl

data_tbl <- html_node(pg_tbl, xpath=".//table[contains(., 'Dossiernummer')]")

然而，HTML 的使用很糟糕（程序员没有经过编辑的线索如何正确地处理网络内容）而且你不能只在上面使用 html_table()（而且你不会'无论如何都不想，因为您可能想要 PDF 的链接或其他什么）。所以，我们可以随意拉出列：

html_nodes(data_tbl, xpath=".//td[1]/a") %>% 
  html_text()
## [1] "A-3930/2013" "D-7885/2009" "E-5869/2012" "C-651/2011"  "F-2439/2017" "D-7416/2009"
## [7] "D-838/2011"  "C-859/2011"  "E-1927/2017" "E-2606/2011"

html_nodes(data_tbl, xpath=".//td[2]/a") %>% 
  html_attr("href")
##  [1] "/publiws/download?decisionId=0002b1f8-ea53-40bb-8e38-402d9f3fdfa9"
##  [2] "/publiws/download?decisionId=0002da8f-306e-4395-8eed-0b168df8634b"
##  [3] "/publiws/download?decisionId=0003ec45-50be-45b2-8a56-5c0d866c2603"
##  [4] "/publiws/download?decisionId=000508c2-c852-4aef-bc32-3385ddbbe88a"
##  [5] "/publiws/download?decisionId=0006fbb9-228a-4bdc-ac8c-52db67df3b34"
##  [6] "/publiws/download?decisionId=0008a971-6795-434d-90d4-7aeb1961606b"
##  [7] "/publiws/download?decisionId=00099619-519c-4c8f-9cea-a16ed9ab9fd8"
##  [8] "/publiws/download?decisionId=0009ac38-f2b0-4733-b379-05682473b5d9"
##  [9] "/publiws/download?decisionId=000a4e0f-b2a2-483b-a49f-6ad12f4b7849"
## [10] "/publiws/download?decisionId=000be307-37b1-4d46-b651-223ceec9e533"

对任何其他专栏起泡、冲洗、重复，但您可能需要做一些工作才能使它们同样好，这是留给您的练习（即我不会回答相关问题）。

而且，你会想知道你在抓取过程中的位置，所以我们需要抓住 table:

底部的那一行

html_node(pg_tbl, xpath=".//span[contains(@class, 'iceOutFrmt')]") %>% 
  html_text()
## [1] "57,294 Entscheide gefunden, zeige 1 bis 10. Seite 1 von 5,730. Resultat sortiert nach: Relevanz"

将其解析为结果数以及您所在的页面是留给 reader 的练习。

现在，我们需要以编程方式单击 "next page" 直到完成。我将进行两次手动迭代，以证明它可以有效地防止 "it doesn't work" 评论。您应该编写一个迭代器或循环来遍历所有下一页并根据需要保存数据。

下一页（第一次迭代）：

httr::POST(
  url = "https://jurispub.admin.ch/publiws/block/send-receive-updates",
  body = list(
    `$ice.submit.partial` = "true",
    ice.event.target = "form:_id67",
    ice.event.captured = "form:_id63next",
    ice.event.type = "onclick",
    ice.event.alt = "false",
    ice.event.ctrl = "false",
    ice.event.shift = "false",
    ice.event.meta = "false",
    ice.event.x = "330", 
    ice.event.y = "559",
    ice.event.left = "true",
    ice.event.right = "false",
    form = "", 
    icefacesCssUpdates = "",
    `form:_id63` = "next",
    `form:_idcl` = "form:_id63next",
    iceTooltipInfo = "tooltip_id=form:resultTable:7:tt_ps; tooltip_src_id=form:resultTable:7:_id57; tooltip_state=hide; tooltip_x=846; tooltip_y=433; cntxValue=",
    ice.session =  ice_session,
    ice.view = "1", 
    ice.focus = "form:_id63next",
    rand = "0.17641832791084566\n\n"
  ),
  encode = "form"
) -> next_pg

httr::content(next_pg) %>% 
  xml_find_first("//updates/update/content") %>% 
  xml_text() %>% 
  read_html() -> pg_tbl

data_tbl <- html_node(pg_tbl, xpath=".//table[contains(., 'Dossiernummer')]")

html_nodes(data_tbl, xpath=".//td[1]/a") %>% 
  html_text()
##  [1] "D-4059/2011" "D-4389/2006" "E-4019/2006" "D-4291/2008" "E-5642/2012" "E-7752/2010"
##  [7] "D-7010/2014" "D-1551/2013" "C-7715/2010" "E-3187/2013"

html_nodes(data_tbl, xpath=".//td[2]/a") %>% 
  html_attr("href")
##  [1] "/publiws/download?decisionId=000bfd02-4da5-4bb2-a5d0-e9977bf8e464"
##  [2] "/publiws/download?decisionId=000e2be1-6da8-47ff-b707-4a3537320a82"
##  [3] "/publiws/download?decisionId=000fa961-ecb4-47d2-8ca3-72e8824c2c6b"
##  [4] "/publiws/download?decisionId=0010a089-4f19-433e-b106-6d75833fae9a"
##  [5] "/publiws/download?decisionId=00111bfc-3522-4a32-9e7a-fa2d9f171427"
##  [6] "/publiws/download?decisionId=00126b65-b345-4988-826b-b213080caa45"
##  [7] "/publiws/download?decisionId=00127944-5c88-43f6-9ef1-3c822288b0c7"
##  [8] "/publiws/download?decisionId=00135a17-f1eb-4b61-9171-ac1d27fd3910"
##  [9] "/publiws/download?decisionId=0014c6ea-c229-4129-bbe0-7411d34d9743"
## [10] "/publiws/download?decisionId=00167998-54d2-40a5-b02b-0c4546ac4760"

html_node(pg_tbl, xpath=".//span[contains(@class, 'iceOutFrmt')]") %>% 
  html_text()
## [1] "57,294 Entscheide gefunden, zeige 11 bis 20. Seite 2 von 5,730. Resultat sortiert nach: Relevanz"

请注意列值不同，进度文本也不同。另请注意，我们很幸运，网站上不称职的程序员实际上有一个 "next" 事件，而不是强迫我们找出页码和 X/Y 坐标。

下一页（第二个也是最后一个示例迭代）：

httr::POST(
  url = "https://jurispub.admin.ch/publiws/block/send-receive-updates",
  body = list(
    `$ice.submit.partial` = "true",
    ice.event.target = "form:_id67",
    ice.event.captured = "form:_id63next",
    ice.event.type = "onclick",
    ice.event.alt = "false",
    ice.event.ctrl = "false",
    ice.event.shift = "false",
    ice.event.meta = "false",
    ice.event.x = "330", 
    ice.event.y = "559",
    ice.event.left = "true",
    ice.event.right = "false",
    form = "", 
    icefacesCssUpdates = "",
    `form:_id63` = "next",
    `form:_idcl` = "form:_id63next",
    iceTooltipInfo = "tooltip_id=form:resultTable:7:tt_ps; tooltip_src_id=form:resultTable:7:_id57; tooltip_state=hide; tooltip_x=846; tooltip_y=433; cntxValue=",
    ice.session =  ice_session,
    ice.view = "1", 
    ice.focus = "form:_id63next",
    rand = "0.17641832791084566\n\n"
  ),
  encode = "form"
) -> next_pg

httr::content(next_pg) %>% 
  xml_find_first("//updates/update/content") %>% 
  xml_text() %>% 
  read_html() -> pg_tbl

data_tbl <- html_node(pg_tbl, xpath=".//table[contains(., 'Dossiernummer')]")

html_nodes(data_tbl, xpath=".//td[1]/a") %>% 
  html_text()
##  [1] "D-3974/2010" "D-5847/2009" "D-4241/2015" "E-3043/2010" "D-602/2016"  "C-2065/2008"
##  [7] "D-2753/2007" "E-2446/2010" "C-1124/2015" "B-7400/2006"

html_nodes(data_tbl, xpath=".//td[2]/a") %>% 
  html_attr("href")
##  [1] "/publiws/download?decisionId=00173ef1-2900-49d4-b7d3-39246e552a70"
##  [2] "/publiws/download?decisionId=001a344c-86b7-4f32-97f7-94d30669a583"
##  [3] "/publiws/download?decisionId=001ae810-300d-4291-8fd0-35de720a6678"
##  [4] "/publiws/download?decisionId=001c2025-57dd-4bc6-8bd6-eedbd719a6e3"
##  [5] "/publiws/download?decisionId=001c44ba-e605-455d-9609-ed7dffb17adc"
##  [6] "/publiws/download?decisionId=001c6040-4b81-4137-a6ee-bad5a5019e71"
##  [7] "/publiws/download?decisionId=001d0811-a5c2-4856-aef3-51a44f7f2b0e"
##  [8] "/publiws/download?decisionId=001dbf61-b1b8-468d-936e-30b174a8bec9"
##  [9] "/publiws/download?decisionId=001ea85a-0765-4a1f-9b81-3cecb9f36b31"
## [10] "/publiws/download?decisionId=001f2e34-9718-4ef7-a60c-e6bbe208003b"

html_node(pg_tbl, xpath=".//span[contains(@class, 'iceOutFrmt')]") %>% 
  html_text()
## [1] "57,294 Entscheide gefunden, zeige 21 bis 30. Seite 3 von 5,730. Resultat sortiert nach: Relevanz"

理想情况下，您可以将 POST 包装在一个您可以调用的函数中，并将 return 数据帧包装到您可以 rbind 或 bind_rows 的大数据帧中.

如果您做到了这一点，另一种方法是使用 RSelenium 来协调 "next page" select 上的页面点击，或者检索 HTML 返回（table 仍然会很糟糕，您需要使用列定位或其他一些 HTML select 或魔法来从中获取有用的信息，因为上述无能的程序员）。 RSelenium 引入了一个外部依赖项——正如您在 SO 上进行搜索时会看到的那样——许多 R 用户在开始工作时遇到困难，尤其是在同样糟糕的遗留操作系统 Windows 上。如果你能让 Selenium 运行ning 和 RSelenium 一起工作，那么在长运行中可能会更容易，如果以上所有内容看起来令人生畏（你仍然需要在某些时候使用开发人员工具所以上面的内容可能是值得的，你也需要 HTML select 或者 Selenium 的各种按钮的目标。

我会认真避免使用 phantomjs，因为它现在处于 "best effort" 维护状态，您必须弄清楚如何使用 JavaScript 与 R.

进行上述操作

Answer 2

让 Selenium 工作（从长远来看运行）可能比试图找出获得和维持会话所需的细微差别更容易：

library(wdman) # for managing the Selenium server d/l
library(RSelenium) # for getting a connection to the Selenium server
library(seleniumPipes) # for better navigation & scraping idioms

这应该安装 jar 并启动服务器：

selServ <- selenium()

我们需要端口 # 所以这样做并在消息中查找端口

selServ$log()$stderr

现在我们需要连接到它，我们需要使用来自 ^^ 的端口 #。在我的例子中是 4567：

sel <- remoteDr(browserName = "chrome", port = 4567)

现在，转到主URL:

sel %>% 
  go("https://jurispub.admin.ch/publiws/pub/search.jsf")

点击初始提交按钮开始抓取过程

sel %>% 
  findElement("name", "form:searchSubmitButton") %>%  # find the submit button 
  elementClick() # click it

我们现在在下一页，所以像其他答案示例中那样抓住列：

sel %>% 
  getPageSource() %>% # like read_html()
  html_node("table.iceDatTbl") -> dtbl  # this is the data table

html_nodes(dtbl, xpath=".//td[@class='iceDatTblCol1']/a") %>% # get doc ids
  html_text()

html_nodes(dtbl, xpath=".//td[@class='iceDatTblCol2']/a[contains(@href, 'publiws')]") %>% 
  html_attr("href") # get pdf links

等等……对于其他列，如其他答案

现在像其他答案一样获取分页信息：

sel %>% 
  getPageSource() %>% 
  html_node("span.iceOutFrmt") %>% 
  html_text() # the total items / pagination info

sel %>%
  findElement("xpath", ".//img[contains(@src, 'arrow-next')]/../../a") %>% 
  elementClick() # go to next page

重复上述table抓取。您应该根据其他答案的建议，根据总 items/pagination 信息将整个内容放在 for 循环中。

完成后，别忘了打电话：

selServ$stop()

使用 R 从使用 .JSF 搜索的页面转义数据

scape data from a page that uses .JSF search using R

r

web-scraping

phantomjs

rselenium

rvest