硒; 循环下载csv文件

debugcn 发表于 Dev

马克西米利亚诺·罗德里格斯

我正在尝试使用RSelenium（与docker）从此网站提取数据：https ://nominatransparente.rhnet.gob.mx

#-- Load package
library(RSelenium)
library(rvest)
library(xml2)
library(tidyverse)

#-- Remote driver
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", port = 4445L, browserName = "chrome")
remDr$open()

#-- navigate to the website 
remDr$navigate("https://nominatransparente.rhnet.gob.mx/")

#-- confirm the website
remDr$getTitle()

#-- screenshot 
remDr$screenshot(display = TRUE)

#-- Loading website's extra information
Sys.sleep(15)

#-- selecting filters: manipulate 
webElement <- remDr$findElement("class name", "switch")
webElement$clickElement()

webElement <- remDr$findElement("class name", "ng-input")
webElement$clickElement()

在此之前，我只能选择并单击下拉菜单，但无法从下拉菜单中选择每个项目（我无法找到正确的xpath或ID）。我想浏览所有这些项目以及第二个下拉菜单，然后下载它们各自的CSV文件。

我想使用RSelenium执行所有操作。我在这里看到了类似的问题，但使用了rvest。有没有一种有效的方法来提取所有CSV文件？

没有人

我的西班牙语有点生锈，但是如果我没记错的话，您尝试先切换los filtros de búsqueda por Sector e Institución然后再进行sectorxinstitución组合。

如果单击组合之一，例如Aportaciones de Seguridad Socialx Fondo de la Vivienda del ISSSTE，则可以观察到以下网络请求：

method GET
url "https://dgti-ejz-mspadronserpub.200.34.175.120.nip.io/ms/InfoPadron/servidoresPublicosSector/19/HC6/1/100?query=nombres,primerApellido,segundoApellido,dependencia,tipoEntidad,nombrePuesto,sueldoBase,compensacionGarantizada"
Headers:
Host: dgti-ejz-mspadronserpub.200.34.175.120.nip.io
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101                 
Firefox/71.0
Accept: application/json
Accept-Language: de,en-US;q=0.7,en;q=0.3
Accept-Encoding: gzip, deflate, br
Referer: https://nominatransparente.rhnet.gob.mx/
Origin: https://nominatransparente.rhnet.gob.mx
Connection: keep-alive
TE: Trailers

这种反应是一个JSON包含相关数据，我们可以让内部的非常相同的请求R使用httr：

# Make the request
headers <- c(
    "Host" = "dgti-ejz-mspadronserpub.200.34.175.120.nip.io",
    "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv=71.0) Gecko/20100101 Firefox/71.0",
    "Accept" = "application/json",
    "Referer" = "https://nominatransparente.rhnet.gob.mx",
    "Origin" = "https://nominatransparente.rhnet.gob.mx",
    "Connection" = "keep-alive",
    "TE" = "Trailers"
)
url <- "https://dgti-ejz-mspadronserpub.200.34.175.120.nip.io/ms/InfoPadron/servidoresPublicosSector/19/HC6/1/100?query=nombres,primerApellido,segundoApellido,dependencia,tipoEntidad,nombrePuesto,sueldoBase,compensacionGarantizada"

response <- httr::GET(url, httr::add_headers(headers))
# Extract the data
data <- httr::content(response)
# Example, the first entry
data$listDtoServidorPublico[[1]]
# $nombres
# [1] "JOSE OSCAR"
# 
# $primerApellido
# [1] "ABURTO"
# 
# $segundoApellido
# [1] "LOPEZ"
# 
# $dependencia
# [1] "FONDO DE LA VIVIENDA DEL ISSSTE"
# 
# $tipoEntidad
# [1] "ORGANISMO DESCENTRALIZADO"
# 
# $nombrePuesto
# [1] "JEFE DE AREA PROF B EN PROC HIPOTEC FOVISSSTE"
# 
# $sueldoBase
# [1] 9432
# 
# $compensacionGarantizada
# [1] 2096

如您所见，此版本比使用Selenium + Docker重型火炮要简单得多。

此外，您可以遍历sectorx个institución组合。关键可能是更改URL参数以接收不同的组合（即?query=...URL的一部分。我本人并未对此进行调查，但是通过在请求其他组合时检查DOM和网络，您应该能够弄清楚这一点。

编辑1：检查网络

在浏览器中，切换开发人员工具，然后在内部点击网络标签。当您执行Buscar时，将出现一个新请求，即一个与上述请求类似的请求（取决于所选组合）。

我已经完成了另一个组合，并观察到请求的网址是

https://dgti-ejz-mspadronserpub.200.34.175.120.nip.io/ms/InfoPadron/servidoresPublicosSector/25/C00/1/100?query=nombres,primerApellido,segundoApellido,dependencia,tipoEntidad,nombrePuesto,sueldoBase,compensacionGarantizada

因此，我对必须调整网址的哪一部分是错误的：如果您比较这两个链接，则这是它们的不同之处

 url_1 = x + 19/HC6 + y
 url_2 = x + 25/C00 + y
 # where
 x = https://dgti-ejz-mspadronserpub.200.34.175.120.nip.io/ms/InfoPadron/servidoresPublicosSector/
 y = /100?query=nombres,primerApellido,segundoApellido,dependencia,tipoEntidad,nombrePuesto,sueldoBase,compensacionGarantizada

因此，看起来每个sectorxinstitución都编码为VW/XYZ。如果检索所有这些，则可以遍历组合。

最后，如果进一步检查网络机会，您会发现一些包含这些编码映射的请求。

编辑2

令人怀疑的是，在检查网络时，我对标sectores.json有以下请求url的请求进行了处理https://nominatransparente.rhnet.gob.mx/assets/sectores.json。这sector至少包含了我所指的部分的映射。进一步看可能会产生类似的结果instutución。

可能您必须切换并单击给定sector，然后查看给定的所有institucón选项sector。然后在DOM内，您将看到类似的映射。我会建议：

1. Get the sector mapping
2. Find out inside the network how the list of instituciónes is given back. Probably something like:
-> Request containing sector-ID in the URL -> return a JSON with all instituciónes
3. Once you figure out the logic behind it, use httr::GET to create a list of all sector x institución
4. Once you have this list, iterate over all combinations to get JSON data as above.

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-04-1

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章

硒; 循环下载csv文件

硒; 循环下载csv文件

使用硒文件下载

重命名下载的文件硒

如何使用硒下载文件？

硒下载时提供文件名

硒未完全下载文件

使用python硒单击并下载文件

硒虽然循环？

硒上传文件

用硒上传文件

用硒循环List <WebElement>

如何使用硒下载“ idx”文件？（“ idx”的MIME是什么）

硒花太多时间下载文件

使用python +硒问题下载pdf文件

如何使用硒选择文件

无法通过硒上传文件？

使用硒将多个 for 循环组合成 CSV

硒pdf自动下载不起作用

是否有硒等待下载完成？

下载用于硒测试的 chrome 扩展

嵌套循环AttributeError：打开文件时__exit __（硒脚本）

python硒在循环中查找子元素

for循环中的硒过时元素引用

循环中硒不单击元素

For循环硒-未完成的任务

循环获取每个硒VBA的名称

快捷键，用于从Chrome的下载栏中打开下载的文件（用于硒）

硒：无法将报废的元素放入CSV

如何使用硒测试附加文件

硒-将文件上传到iframe

嵌套循环AttributeError：打开文件时exit （硒脚本）