R - 带有输入的网络抓取动态表单
R - web scraping dynamic form with inputs
我正试图在 R 中完成以下任务。
本网站提供有关印度农业数据的分区级统计数据(在 table 秒内):http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx
我的理解是,这称为动态表单,因为选项会根据所做的条目而变化。具体来说,我想下载 tables 用于:
- 邦=安得拉邦
- 地区 = 阿迪拉巴德、阿南塔普尔、卡达帕、...(共 8 个)
- Tahsil = Mancherial, Kasipet(这些只是 District = Adilabad 的两个例子;总共 158 个)
然后我想要 "Average Holdings by size" 所有社会群体,所有性别和总数。
基于此 post What if I want to web scrape with R for a page with parameters? 我认为要走的路是使用 getHTMLFormDescription()。但是,由于我的表单是动态的,所以我无法遵循另一个 post 中建议的路线。 createFunction() 行 returns 一个错误:
“* writeFunction(formDescription, character(), url, con, verbose = verbose, 错误:您应该在此处提供表单描述。请参阅 getFormDescription()。”
在omegahat网页可以下载的RHTMLForms包中,
有这个功能(顾名思义)应该做我需要的:
function function(desc, omit = character(), drop = TRUE, ..., verbose = FALSE) {
# Discard the elements that we are omitting.
if(length(omit)) {
idx = match(omit, names(desc$elements), 0)
k = class(desc$elements)
desc$elements <- desc$elements[-idx]
class(desc$elements) = k }
# If no more elements left as a result of omitting them, just return the description
# as there are definitely no more dynamic components left.
if(length(desc$elements) == 0)
return(desc)
# Now find the dynamic components.
dyn = sapply(desc$elements, inherits, "DynamicHTMLFormElement")
if(!any(dyn))
return(desc)
pivot = desc$elements[[min(which(dyn))]]
# We will need to submit the form for each value of this dynamic element, so
# get the URI. If the URI changes depending on the value, we are out of luck!!
url = mergeURI(URI(desc$formAttributes["action"]), URI(desc$url))
# Prepare the return value with the pivot information and we will build up
# the branches by looping over the possible values.
descriptions = list(elementName = pivot$name,
description = pivot,
values = list())
omit = c(omit, pivot$name)
for(i in names(pivot$options)) {
# Create the arguments for the submission. We may need to include them all.
args = list(i)
names(args)[1] = pivot$name
if(verbose)
cat("Checking ", pivot$name, " - option", i, "\n")
#XX we may need to provide all the arguments rather than just this one.
# or perhaps cumulate them for the elements we have already deal with.
# We have the defaults and the possible values from the original description.
page = formQuery(args, toString(url), desc, .checkArgs = FALSE, ...)
# Make certain that we turn the checkDynamic off here to avoid recursively.
tmp = getHTMLFormDescription(page, asText = TRUE, handlers = multiFormElementHandlers(url, checkDynamic = FALSE))
tmp = getDynamicHTMLFormDescription(tmp, omit = omit)
# Now remove the elements that we are omitting. This leaves a subset of the form.
if(drop) {
idx = match(omit, names(tmp$elements), 0)
if(any(is.na(idx))) {
k = class(tmp$elements)
tmp$elements = tmp$elements[is.na(idx)]
class(tmp$elements) = k
}
class(tmp) <- c("HTMLFormSubset", class(tmp))
}
descriptions$values[[i]] = tmp
}
class(descriptions) <- c("DynamicFormElementPath")
descriptions
}
但是,我也无法让它工作 - 调用 getDynamicHTMLFormDescription("http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx") 给出 "Error in desc$elements : $ operator is invalid for atomic vectors".
有人对如何解决这个问题有什么建议吗?一旦我有办法填写表格并访问每个分区 (tahsil) 的 table,我就知道如何整理数据。这实际上只是让 R 填写这个(特定的)表格。
欢迎任何帮助!
迈克尔·凯撒
(加州大学圣地亚哥分校助理研究员)
这是使用 RSelenium
为
下载数据的解决方案
- 邦=安得拉邦
- 地区 = 阿迪拉巴特
- 泰西尔 = Mancherial
- 表格 = 按规模组划分的运营控股的平均规模
其余字段使用默认输入参数。
library(RSelenium)
library(XML)
library(magrittr)
# Start Selenium Server --------------------------------------------------------
checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()
# Simulate browser session and fill out form -----------------------------------
remDrv$navigate('http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx')
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList1']/option[@value = '1a']")$clickElement()
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option[@value = '19']")$clickElement()
remDrv$findElement(using = "xpath",
"//option[@value = '33']")$clickElement()
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList3']/option[@value = '4']")$clickElement()
# Click submit
remDrv$findElement(using = "xpath",
"//input[@value = 'Submit']")$clickElement()
# Retrieve and download results ------------------------------------------------
table <- remDrv$getPageSource()[[1]] %>%
htmlParse %>%
readHTMLTable %>%
extract2(4)
remDrv$quit()
remDrv$closeServer()
head(table)
# V1 V2 V3
# 1 SI No. Size of Holding(in ha.) Institutional Holdings
# 2 (1) (2) (3)
# 3 1 MARGINAL 0
# 4 2 SMALL 0
# 5 3 SEMIMEDIUM 0
# 6 4 MEDIUM 0
但是,上面的静态解决方案只回答了您的部分问题,即如何使用 R 填写网络表单。
您网页上的棘手之处在于不同下拉菜单中的值相互依赖。
下面,您将找到一个解决方案,它考虑了这些依赖关系,而不需要您预先知道各自的地区和 tehsils ID。
下面的代码为
下载数据
- 州 = GOA
- 表格 = 按规模组划分的运营控股的平均规模
包括所有地区和所有 tehsils。我使用 GOA 作为主要锚点,但您可以轻松地 select 选择其他状态。
library(RSelenium)
library(XML)
library(dplyr)
library(magrittr)
# Start Selenium Server --------------------------------------------------------
checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()
# Simulate browser session and fill out form -----------------------------------
remDrv$navigate('http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx')
# Select 27a == GOA as the anchor
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList1']/option[@value = '27a']")$clickElement()
# Select 4 == Average Size of Operational Holding by Size Group
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList3']/option[@value = '4']")$clickElement()
# Get all district IDs and the respective names belonging to GOA
district_IDs <- remDrv$findElements(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option") %>%
lapply(function(x){x$getElementAttribute('value')}) %>%
unlist
district_names <- remDrv$findElements(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option") %>%
lapply(function(x){x$getElementText()}) %>%
unlist
# Retrieve and download results ------------------------------------------------
result <- data.frame(district = character(), teshil = character(),
V1 = character(), V2 = character(), V3 = character())
for (i in seq_along(district_IDs)) {
remDrv$findElement(using = "xpath",
paste0("//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option[@value = ",
"'", district_IDs[i], "']"))$clickElement()
Sys.sleep(2)
# Get all tehsil IDs and names from the currently selected district
tehsil_IDs <- remDrv$findElements(using = "xpath",
"//div[@id = '_ctl0_ContentPlaceHolder1_Panel4']/select/option") %>%
lapply(function(x){x$getElementAttribute('value')}) %>%
unlist
tehsil_names <- remDrv$findElements(using = "xpath",
"//div[@id = '_ctl0_ContentPlaceHolder1_Panel4']/select/option") %>%
lapply(function(x){x$getElementText()}) %>%
unlist
for (j in seq_along(tehsil_IDs)) {
remDrv$findElement(using = "xpath",
paste0("//div[@id = '_ctl0_ContentPlaceHolder1_Panel4']/select/option[@value = ",
"'", tehsil_IDs[j], "']"))$clickElement()
Sys.sleep(2)
# Click submit and download data of the selected tehsil
remDrv$findElement(using = "xpath",
"//input[@value = 'Submit']")$clickElement()
Sys.sleep(2)
# Download data for current tehsil
tehsil_data <- remDrv$getPageSource()[[1]] %>%
htmlParse %>%
readHTMLTable %>%
extract2(4) %>%
extract(c(-1, -2), )
result <- data.frame(district = district_names[i], tehsil = tehsil_names[j],
tehsil_data) %>% rbind(result, .)
remDrv$goBack()
Sys.sleep(2)
}
}
remDrv$quit()
remDrv$closeServer()
result %<>% as_data_frame %>%
rename(
si_no = V1,
holding_size = V2,
inst_holdings = V3
) %>%
mutate(
si_no = as.numeric(as.character(si_no)),
inst_holdings = as.numeric(as.character(inst_holdings))
)
dim(result)
# [1] 66 5
head(result)
# district tehsil si_no holding_size inst_holdings
# 1 NORTH GOA ponda 1 MARGINAL 0.34
# 2 NORTH GOA ponda 2 SMALL 0.00
# 3 NORTH GOA ponda 3 SEMIMEDIUM 2.50
# 4 NORTH GOA ponda 4 MEDIUM 0.00
# 5 NORTH GOA ponda 5 LARGE 182.64
# 6 NORTH GOA ponda 6 ALL SIZE CLASS 41.09
tail(result)
# district tehsil si_no holding_size inst_holdings
# 1 SOUTH GOA quepem 1 MARGINAL 0.30
# 2 SOUTH GOA quepem 2 SMALL 0.00
# 3 SOUTH GOA quepem 3 SEMIMEDIUM 0.00
# 4 SOUTH GOA quepem 4 MEDIUM 0.00
# 5 SOUTH GOA quepem 5 LARGE 23.50
# 6 SOUTH GOA quepem 6 ALL SIZE CLASS 15.77
RSelenium 甚至支持利用 PhantomJS 的无头浏览,如 vignette.
中所述
我正试图在 R 中完成以下任务。 本网站提供有关印度农业数据的分区级统计数据(在 table 秒内):http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx 我的理解是,这称为动态表单,因为选项会根据所做的条目而变化。具体来说,我想下载 tables 用于:
- 邦=安得拉邦
- 地区 = 阿迪拉巴德、阿南塔普尔、卡达帕、...(共 8 个)
- Tahsil = Mancherial, Kasipet(这些只是 District = Adilabad 的两个例子;总共 158 个)
然后我想要 "Average Holdings by size" 所有社会群体,所有性别和总数。
基于此 post What if I want to web scrape with R for a page with parameters? 我认为要走的路是使用 getHTMLFormDescription()。但是,由于我的表单是动态的,所以我无法遵循另一个 post 中建议的路线。 createFunction() 行 returns 一个错误: “* writeFunction(formDescription, character(), url, con, verbose = verbose, 错误:您应该在此处提供表单描述。请参阅 getFormDescription()。”
在omegahat网页可以下载的RHTMLForms包中, 有这个功能(顾名思义)应该做我需要的:
function function(desc, omit = character(), drop = TRUE, ..., verbose = FALSE) {
# Discard the elements that we are omitting.
if(length(omit)) {
idx = match(omit, names(desc$elements), 0)
k = class(desc$elements)
desc$elements <- desc$elements[-idx]
class(desc$elements) = k }
# If no more elements left as a result of omitting them, just return the description
# as there are definitely no more dynamic components left.
if(length(desc$elements) == 0)
return(desc)
# Now find the dynamic components.
dyn = sapply(desc$elements, inherits, "DynamicHTMLFormElement")
if(!any(dyn))
return(desc)
pivot = desc$elements[[min(which(dyn))]]
# We will need to submit the form for each value of this dynamic element, so
# get the URI. If the URI changes depending on the value, we are out of luck!!
url = mergeURI(URI(desc$formAttributes["action"]), URI(desc$url))
# Prepare the return value with the pivot information and we will build up
# the branches by looping over the possible values.
descriptions = list(elementName = pivot$name,
description = pivot,
values = list())
omit = c(omit, pivot$name)
for(i in names(pivot$options)) {
# Create the arguments for the submission. We may need to include them all.
args = list(i)
names(args)[1] = pivot$name
if(verbose)
cat("Checking ", pivot$name, " - option", i, "\n")
#XX we may need to provide all the arguments rather than just this one.
# or perhaps cumulate them for the elements we have already deal with.
# We have the defaults and the possible values from the original description.
page = formQuery(args, toString(url), desc, .checkArgs = FALSE, ...)
# Make certain that we turn the checkDynamic off here to avoid recursively.
tmp = getHTMLFormDescription(page, asText = TRUE, handlers = multiFormElementHandlers(url, checkDynamic = FALSE))
tmp = getDynamicHTMLFormDescription(tmp, omit = omit)
# Now remove the elements that we are omitting. This leaves a subset of the form.
if(drop) {
idx = match(omit, names(tmp$elements), 0)
if(any(is.na(idx))) {
k = class(tmp$elements)
tmp$elements = tmp$elements[is.na(idx)]
class(tmp$elements) = k
}
class(tmp) <- c("HTMLFormSubset", class(tmp))
}
descriptions$values[[i]] = tmp
}
class(descriptions) <- c("DynamicFormElementPath")
descriptions
}
但是,我也无法让它工作 - 调用 getDynamicHTMLFormDescription("http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx") 给出 "Error in desc$elements : $ operator is invalid for atomic vectors".
有人对如何解决这个问题有什么建议吗?一旦我有办法填写表格并访问每个分区 (tahsil) 的 table,我就知道如何整理数据。这实际上只是让 R 填写这个(特定的)表格。
欢迎任何帮助! 迈克尔·凯撒 (加州大学圣地亚哥分校助理研究员)
这是使用 RSelenium
为
- 邦=安得拉邦
- 地区 = 阿迪拉巴特
- 泰西尔 = Mancherial
- 表格 = 按规模组划分的运营控股的平均规模
其余字段使用默认输入参数。
library(RSelenium)
library(XML)
library(magrittr)
# Start Selenium Server --------------------------------------------------------
checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()
# Simulate browser session and fill out form -----------------------------------
remDrv$navigate('http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx')
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList1']/option[@value = '1a']")$clickElement()
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option[@value = '19']")$clickElement()
remDrv$findElement(using = "xpath",
"//option[@value = '33']")$clickElement()
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList3']/option[@value = '4']")$clickElement()
# Click submit
remDrv$findElement(using = "xpath",
"//input[@value = 'Submit']")$clickElement()
# Retrieve and download results ------------------------------------------------
table <- remDrv$getPageSource()[[1]] %>%
htmlParse %>%
readHTMLTable %>%
extract2(4)
remDrv$quit()
remDrv$closeServer()
head(table)
# V1 V2 V3
# 1 SI No. Size of Holding(in ha.) Institutional Holdings
# 2 (1) (2) (3)
# 3 1 MARGINAL 0
# 4 2 SMALL 0
# 5 3 SEMIMEDIUM 0
# 6 4 MEDIUM 0
但是,上面的静态解决方案只回答了您的部分问题,即如何使用 R 填写网络表单。
您网页上的棘手之处在于不同下拉菜单中的值相互依赖。
下面,您将找到一个解决方案,它考虑了这些依赖关系,而不需要您预先知道各自的地区和 tehsils ID。
下面的代码为
下载数据- 州 = GOA
- 表格 = 按规模组划分的运营控股的平均规模
包括所有地区和所有 tehsils。我使用 GOA 作为主要锚点,但您可以轻松地 select 选择其他状态。
library(RSelenium)
library(XML)
library(dplyr)
library(magrittr)
# Start Selenium Server --------------------------------------------------------
checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()
# Simulate browser session and fill out form -----------------------------------
remDrv$navigate('http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx')
# Select 27a == GOA as the anchor
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList1']/option[@value = '27a']")$clickElement()
# Select 4 == Average Size of Operational Holding by Size Group
remDrv$findElement(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList3']/option[@value = '4']")$clickElement()
# Get all district IDs and the respective names belonging to GOA
district_IDs <- remDrv$findElements(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option") %>%
lapply(function(x){x$getElementAttribute('value')}) %>%
unlist
district_names <- remDrv$findElements(using = "xpath",
"//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option") %>%
lapply(function(x){x$getElementText()}) %>%
unlist
# Retrieve and download results ------------------------------------------------
result <- data.frame(district = character(), teshil = character(),
V1 = character(), V2 = character(), V3 = character())
for (i in seq_along(district_IDs)) {
remDrv$findElement(using = "xpath",
paste0("//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option[@value = ",
"'", district_IDs[i], "']"))$clickElement()
Sys.sleep(2)
# Get all tehsil IDs and names from the currently selected district
tehsil_IDs <- remDrv$findElements(using = "xpath",
"//div[@id = '_ctl0_ContentPlaceHolder1_Panel4']/select/option") %>%
lapply(function(x){x$getElementAttribute('value')}) %>%
unlist
tehsil_names <- remDrv$findElements(using = "xpath",
"//div[@id = '_ctl0_ContentPlaceHolder1_Panel4']/select/option") %>%
lapply(function(x){x$getElementText()}) %>%
unlist
for (j in seq_along(tehsil_IDs)) {
remDrv$findElement(using = "xpath",
paste0("//div[@id = '_ctl0_ContentPlaceHolder1_Panel4']/select/option[@value = ",
"'", tehsil_IDs[j], "']"))$clickElement()
Sys.sleep(2)
# Click submit and download data of the selected tehsil
remDrv$findElement(using = "xpath",
"//input[@value = 'Submit']")$clickElement()
Sys.sleep(2)
# Download data for current tehsil
tehsil_data <- remDrv$getPageSource()[[1]] %>%
htmlParse %>%
readHTMLTable %>%
extract2(4) %>%
extract(c(-1, -2), )
result <- data.frame(district = district_names[i], tehsil = tehsil_names[j],
tehsil_data) %>% rbind(result, .)
remDrv$goBack()
Sys.sleep(2)
}
}
remDrv$quit()
remDrv$closeServer()
result %<>% as_data_frame %>%
rename(
si_no = V1,
holding_size = V2,
inst_holdings = V3
) %>%
mutate(
si_no = as.numeric(as.character(si_no)),
inst_holdings = as.numeric(as.character(inst_holdings))
)
dim(result)
# [1] 66 5
head(result)
# district tehsil si_no holding_size inst_holdings
# 1 NORTH GOA ponda 1 MARGINAL 0.34
# 2 NORTH GOA ponda 2 SMALL 0.00
# 3 NORTH GOA ponda 3 SEMIMEDIUM 2.50
# 4 NORTH GOA ponda 4 MEDIUM 0.00
# 5 NORTH GOA ponda 5 LARGE 182.64
# 6 NORTH GOA ponda 6 ALL SIZE CLASS 41.09
tail(result)
# district tehsil si_no holding_size inst_holdings
# 1 SOUTH GOA quepem 1 MARGINAL 0.30
# 2 SOUTH GOA quepem 2 SMALL 0.00
# 3 SOUTH GOA quepem 3 SEMIMEDIUM 0.00
# 4 SOUTH GOA quepem 4 MEDIUM 0.00
# 5 SOUTH GOA quepem 5 LARGE 23.50
# 6 SOUTH GOA quepem 6 ALL SIZE CLASS 15.77
RSelenium 甚至支持利用 PhantomJS 的无头浏览,如 vignette.
中所述