R硒；循环和下载 csv 文件

Question

我正在尝试使用 RSelenium（使用 docker）从此网站提取数据：https://nominatransparente.rhnet.gob.mx

#-- Load package
library(RSelenium)
library(rvest)
library(xml2)
library(tidyverse)

#-- Remote driver
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", port = 4445L, browserName = "chrome")
remDr$open()

#-- navigate to the website 
remDr$navigate("https://nominatransparente.rhnet.gob.mx/")

#-- confirm the website
remDr$getTitle()

#-- screenshot 
remDr$screenshot(display = TRUE)

#-- Loading website's extra information
Sys.sleep(15)

#-- selecting filters: manipulate 
webElement <- remDr$findElement("class name", "switch")
webElement$clickElement()

webElement <- remDr$findElement("class name", "ng-input")
webElement$clickElement()

到这里为止，我可以 select 并单击下拉菜单，但无法 select 下拉菜单中的每个项目（我无法找到正确的 xpath 或ID）。我想浏览这些项目中的每一个，也想从第二个下拉菜单中浏览，然后下载它们各自的 CSV 文件。

我想使用 RSelenium 执行所有操作。我看过类似的问题但使用 rvest。有没有一种有效的方法来提取所有 CSV 文件？

Answer 1

我的西班牙语有点生疏，但如果我没记错的话，您正在尝试先切换 los filtros de búsqueda por Sector e Institución，然后再进行 sectorxinstitución 组合。

如果您点击其中一个组合，比如 Aportaciones de Seguridad SocialxFondo de la Vivienda del ISSSTE，您可以观察到以下网络请求：

method GET
url "https://dgti-ejz-mspadronserpub.200.34.175.120.nip.io/ms/InfoPadron/servidoresPublicosSector/19/HC6/1/100?query=nombres,primerApellido,segundoApellido,dependencia,tipoEntidad,nombrePuesto,sueldoBase,compensacionGarantizada"
Headers:
Host: dgti-ejz-mspadronserpub.200.34.175.120.nip.io
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101                 
Firefox/71.0
Accept: application/json
Accept-Language: de,en-US;q=0.7,en;q=0.3
Accept-Encoding: gzip, deflate, br
Referer: https://nominatransparente.rhnet.gob.mx/
Origin: https://nominatransparente.rhnet.gob.mx
Connection: keep-alive
TE: Trailers

此响应是包含相关数据的 JSON，我们可以使用 httr:

在 R 内发出完全相同的请求

# Make the request
headers <- c(
    "Host" = "dgti-ejz-mspadronserpub.200.34.175.120.nip.io",
    "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv=71.0) Gecko/20100101 Firefox/71.0",
    "Accept" = "application/json",
    "Referer" = "https://nominatransparente.rhnet.gob.mx",
    "Origin" = "https://nominatransparente.rhnet.gob.mx",
    "Connection" = "keep-alive",
    "TE" = "Trailers"
)
url <- "https://dgti-ejz-mspadronserpub.200.34.175.120.nip.io/ms/InfoPadron/servidoresPublicosSector/19/HC6/1/100?query=nombres,primerApellido,segundoApellido,dependencia,tipoEntidad,nombrePuesto,sueldoBase,compensacionGarantizada"

response <- httr::GET(url, httr::add_headers(headers))
# Extract the data
data <- httr::content(response)
# Example, the first entry
data$listDtoServidorPublico[[1]]
# $nombres
# [1] "JOSE OSCAR"
# 
# $primerApellido
# [1] "ABURTO"
# 
# $segundoApellido
# [1] "LOPEZ"
# 
# $dependencia
# [1] "FONDO DE LA VIVIENDA DEL ISSSTE"
# 
# $tipoEntidad
# [1] "ORGANISMO DESCENTRALIZADO"
# 
# $nombrePuesto
# [1] "JEFE DE AREA PROF B EN PROC HIPOTEC FOVISSSTE"
# 
# $sueldoBase
# [1] 9432
# 
# $compensacionGarantizada
# [1] 2096

如你所见，这个版本比使用重炮Selenium+Docker.

简单多了

此外，您可以迭代 sectorxinstitución 组合。关键可能是更改 URL 参数以接收不同的组合（即 [=77= 的 ?query=... 部分）。我自己没有调查过这个，但通过检查 DOM 和请求其他组合时的网络你应该能够弄清楚。

编辑 1：检查网络

在您的浏览器中，切换 开发人员工具 并在内部单击选项卡网络。当您执行 Buscar 时，应该会出现一个新的请求，即类似于上面的请求（取决于选择的组合）。

我已经为另一个组合完成了此操作并观察到请求 url 是

https://dgti-ejz-mspadronserpub.200.34.175.120.nip.io/ms/InfoPadron/servidoresPublicosSector/25/C00/1/100?query=nombres,primerApellido,segundoApellido,dependencia,tipoEntidad,nombrePuesto,sueldoBase,compensacionGarantizada

因此我错了 url 您必须调整的部分：如果您比较这两个链接，就会发现它们的不同之处

 url_1 = x + 19/HC6 + y
 url_2 = x + 25/C00 + y
 # where
 x = https://dgti-ejz-mspadronserpub.200.34.175.120.nip.io/ms/InfoPadron/servidoresPublicosSector/
 y = /100?query=nombres,primerApellido,segundoApellido,dependencia,tipoEntidad,nombrePuesto,sueldoBase,compensacionGarantizada

所以看起来每个 sectorxinstitución 都被编码为 VW/XYZ。如果您检索所有这些，则可以迭代这些组合。

最后，如果您进一步检查网络，您可能会发现一些包含这些编码映射的请求。

编辑 2

正如所怀疑的那样，在检查网络时，我遇到了标记为 sectores.json 的请求以及以下请求 url https://nominatransparente.rhnet.gob.mx/assets/sectores.json。这至少包含我所指的 sector 部分的映射。进一步观察可能会为 instutución 产生类似的结果。

可能您必须切换并单击给定的 sector，然后才能看到给定的 sector 的所有 institucón 选项。然后在 DOM 中你会看到一个类似的映射。我会建议：

1. Get the sector mapping
2. Find out inside the network how the list of instituciónes is given back. Probably something like:
-> Request containing sector-ID in the URL -> return a JSON with all instituciónes
3. Once you figure out the logic behind it, use httr::GET to create a list of all sector x institución
4. Once you have this list, iterate over all combinations to get JSON data as above.

R硒；循环和下载 csv 文件

RSelenium; Looping and downloading csv files

loops

r

web-scraping

rselenium

rvest