R - 带有输入的网络抓取动态表单

R - web scraping dynamic form with inputs

我正试图在 R 中完成以下任务。 本网站提供有关印度农业数据的分区级统计数据(在 table 秒内):http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx 我的理解是,这称为动态表单,因为选项会根据所做的条目而变化。具体来说,我想下载 tables 用于:

  1. 邦=安得拉邦
  2. 地区 = 阿迪拉巴德、阿南塔普尔、卡达帕、...(共 8 个)
  3. Tahsil = Mancherial, Kasipet(这些只是 District = Adilabad 的两个例子;总共 158 个)

然后我想要 "Average Holdings by size" 所有社会群体,所有性别和总数。

基于此 post What if I want to web scrape with R for a page with parameters? 我认为要走的路是使用 getHTMLFormDescription()。但是,由于我的表单是动态的,所以我无法遵循另一个 post 中建议的路线。 createFunction() 行 returns 一个错误: “* writeFunction(formDescription, character(), url, con, verbose = verbose, 错误:您应该在此处提供表单描述。请参阅 getFormDescription()。”

在omegahat网页可以下载的RHTMLForms包中, 有这个功能(顾名思义)应该做我需要的:

function function(desc, omit = character(), drop = TRUE, ..., verbose = FALSE) {
# Discard the elements that we are omitting.
 if(length(omit)) {
   idx = match(omit, names(desc$elements), 0)
  k = class(desc$elements)
 desc$elements <- desc$elements[-idx]
class(desc$elements) = k }

 # If no more elements left as a result of omitting them, just return the description
 # as there are definitely no more dynamic components left.
if(length(desc$elements) == 0)
 return(desc)

   # Now find the dynamic components.
 dyn = sapply(desc$elements, inherits, "DynamicHTMLFormElement")
if(!any(dyn)) 
return(desc)

pivot = desc$elements[[min(which(dyn))]]

 # We will need to submit the form for each value of this dynamic element, so
  # get the URI.  If the URI changes depending on the value, we are out of     luck!!
url = mergeURI(URI(desc$formAttributes["action"]), URI(desc$url)) 

 # Prepare the return value with the pivot information and we will build up
 # the branches by looping over the possible values.
 descriptions = list(elementName = pivot$name,
                 description = pivot,
                 values = list())   

 omit = c(omit, pivot$name)

 for(i in names(pivot$options)) {
   # Create the arguments for the submission. We may need to include them all.
 args = list(i)
 names(args)[1] = pivot$name

 if(verbose) 
   cat("Checking ", pivot$name, " - option", i, "\n")

  #XX we may need to provide all the arguments rather than just this one.
  # or perhaps cumulate them for the elements we have already deal with.
  # We have the defaults and the possible values from the original description.
 page = formQuery(args, toString(url), desc, .checkArgs = FALSE, ...)
  # Make certain that we turn the checkDynamic off here to avoid recursively.
 tmp = getHTMLFormDescription(page, asText = TRUE, handlers = multiFormElementHandlers(url, checkDynamic = FALSE))
 tmp = getDynamicHTMLFormDescription(tmp, omit = omit)

   # Now remove the elements that we are omitting. This leaves a subset of the form.
 if(drop) {
    idx = match(omit, names(tmp$elements), 0)

    if(any(is.na(idx))) {
      k = class(tmp$elements)
      tmp$elements = tmp$elements[is.na(idx)]
      class(tmp$elements) = k
    }

    class(tmp) <- c("HTMLFormSubset", class(tmp))        
 }

 descriptions$values[[i]] = tmp
 }

class(descriptions) <- c("DynamicFormElementPath")

descriptions
}

但是,我也无法让它工作 - 调用 getDynamicHTMLFormDescription("http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx") 给出 "Error in desc$elements : $ operator is invalid for atomic vectors".

有人对如何解决这个问题有什么建议吗?一旦我有办法填写表格并访问每个分区 (tahsil) 的 table,我就知道如何整理数据。这实际上只是让 R 填写这个(特定的)表格。

欢迎任何帮助! 迈克尔·凯撒 (加州大学圣地亚哥分校助理研究员)

这是使用 RSelenium

下载数据的解决方案
  1. 邦=安得拉邦
  2. 地区 = 阿迪拉巴特
  3. 泰西尔 = Mancherial
  4. 表格 = 按规模组划分的运营控股的平均规模

其余字段使用默认输入参数。

library(RSelenium)
library(XML)
library(magrittr)

# Start Selenium Server --------------------------------------------------------

checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()


# Simulate browser session and fill out form -----------------------------------

remDrv$navigate('http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx')
remDrv$findElement(using = "xpath", 
                   "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList1']/option[@value = '1a']")$clickElement()
remDrv$findElement(using = "xpath",
                   "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option[@value = '19']")$clickElement()
remDrv$findElement(using = "xpath",
                   "//option[@value = '33']")$clickElement()
remDrv$findElement(using = "xpath",
                   "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList3']/option[@value = '4']")$clickElement()
# Click submit
remDrv$findElement(using = "xpath",
                   "//input[@value = 'Submit']")$clickElement()


# Retrieve and download results ------------------------------------------------

table <- remDrv$getPageSource()[[1]] %>% 
  htmlParse %>% 
  readHTMLTable %>% 
  extract2(4)

remDrv$quit()
remDrv$closeServer()

head(table)

#      V1                      V2                     V3
# 1 SI No. Size of Holding(in ha.) Institutional Holdings
# 2    (1)                     (2)                    (3)
# 3      1                MARGINAL                      0
# 4      2                   SMALL                      0
# 5      3              SEMIMEDIUM                      0
# 6      4                  MEDIUM                      0

但是,上面的静态解决方案只回答了您的部分问题,即如何使用 R 填写网络表单。

您网页上的棘手之处在于不同下拉菜单中的值相互依赖。

下面,您将找到一个解决方案,它考虑了这些依赖关系,而不需要您预先知道各自的地区和 tehsils ID。

下面的代码为

下载数据
  1. 州 = GOA
  2. 表格 = 按规模组划分的运营控股的平均规模

包括所有地区和所有 tehsils。我使用 GOA 作为主要锚点,但您可以轻松地 select 选择其他状态。

library(RSelenium)
library(XML)
library(dplyr)
library(magrittr)

# Start Selenium Server --------------------------------------------------------

checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()


# Simulate browser session and fill out form -----------------------------------

remDrv$navigate('http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx')

# Select 27a == GOA as the anchor
remDrv$findElement(using = "xpath", 
                   "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList1']/option[@value = '27a']")$clickElement()
# Select 4 == Average Size of Operational Holding by Size Group 
remDrv$findElement(using = "xpath",
                   "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList3']/option[@value = '4']")$clickElement()

# Get all district IDs and the respective names belonging to GOA
district_IDs <- remDrv$findElements(using = "xpath",
                                "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option") %>%
  lapply(function(x){x$getElementAttribute('value')}) %>% 
  unlist

district_names <- remDrv$findElements(using = "xpath",
                                    "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option") %>%
  lapply(function(x){x$getElementText()}) %>% 
  unlist


# Retrieve and download results ------------------------------------------------

result <- data.frame(district = character(), teshil = character(), 
                     V1 = character(), V2 = character(), V3 = character())

for (i in seq_along(district_IDs)) {

  remDrv$findElement(using = "xpath",
                     paste0("//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option[@value = ", 
                          "'", district_IDs[i], "']"))$clickElement()
  Sys.sleep(2)

  # Get all tehsil IDs and names from the currently selected district
  tehsil_IDs <- remDrv$findElements(using = "xpath",
                     "//div[@id = '_ctl0_ContentPlaceHolder1_Panel4']/select/option") %>%
    lapply(function(x){x$getElementAttribute('value')}) %>%
    unlist

  tehsil_names <- remDrv$findElements(using = "xpath",
                                                    "//div[@id = '_ctl0_ContentPlaceHolder1_Panel4']/select/option") %>%
    lapply(function(x){x$getElementText()}) %>%
    unlist

  for (j in seq_along(tehsil_IDs)) {

    remDrv$findElement(using = "xpath",
                        paste0("//div[@id = '_ctl0_ContentPlaceHolder1_Panel4']/select/option[@value = ",
                               "'", tehsil_IDs[j], "']"))$clickElement()
    Sys.sleep(2)

    # Click submit and download data of the selected tehsil
    remDrv$findElement(using = "xpath",
                       "//input[@value = 'Submit']")$clickElement()
    Sys.sleep(2)

    # Download data for current tehsil
    tehsil_data <- remDrv$getPageSource()[[1]] %>% 
      htmlParse %>% 
      readHTMLTable %>% 
      extract2(4) %>% 
      extract(c(-1, -2), )

    result <- data.frame(district = district_names[i], tehsil = tehsil_names[j],
                         tehsil_data) %>% rbind(result, .)

    remDrv$goBack()
    Sys.sleep(2)
  }   
}

remDrv$quit()
remDrv$closeServer()

result %<>% as_data_frame %>%
  rename(
    si_no = V1,
    holding_size = V2, 
    inst_holdings = V3
    ) %>% 
  mutate(
    si_no = as.numeric(as.character(si_no)),
    inst_holdings = as.numeric(as.character(inst_holdings))
    )

dim(result)
# [1] 66  5

head(result)
# district tehsil si_no   holding_size inst_holdings
# 1 NORTH GOA  ponda     1       MARGINAL          0.34
# 2 NORTH GOA  ponda     2          SMALL          0.00
# 3 NORTH GOA  ponda     3     SEMIMEDIUM          2.50
# 4 NORTH GOA  ponda     4         MEDIUM          0.00
# 5 NORTH GOA  ponda     5          LARGE        182.64
# 6 NORTH GOA  ponda     6 ALL SIZE CLASS         41.09

tail(result)
# district tehsil si_no   holding_size inst_holdings
# 1 SOUTH GOA quepem     1       MARGINAL          0.30
# 2 SOUTH GOA quepem     2          SMALL          0.00
# 3 SOUTH GOA quepem     3     SEMIMEDIUM          0.00
# 4 SOUTH GOA quepem     4         MEDIUM          0.00
# 5 SOUTH GOA quepem     5          LARGE         23.50
# 6 SOUTH GOA quepem     6 ALL SIZE CLASS         15.77

RSelenium 甚至支持利用 PhantomJS 的无头浏览,如 vignette.

中所述