R:带有隐藏字段的 Web 抓取 .aspx 表单,"Unknown field names" 错误
R: Web scraping .aspx form with hidden fields, "Unknown field names" error
两天来,我一直在思考如何填写表格并提交表格以从 https://www.igb.illinois.gov/VideoReports.aspx 下载 .csv 文件。不幸的是,我似乎无法破解它。完全披露:我是新手网络抓取工具。我可以进行基本的抓取,但这对我来说是新领域。我希望最终能编写一个程序,将所有机构的月度收入报告拉回到 2009 年 9 月。
看来主要问题与表单的布局方式有关。我似乎无法弄清楚如何指定我想要填写的字段以请求 .csv 文件。我一直在使用 rvest
和 RHTMLForms
。我在 chrome 开发工具中找到了表格,可以看到我需要的一切。我似乎无法深入了解我需要去哪里提交查询。
这是我到目前为止的进展:
library('rvest')
library('RHTMLForms')
igb <- "https://www.igb.illinois.gov/VideoReports.aspx"
igb_html <- read_html(igb)
igbForm <- html_form(igb_html)
igbForm
问题似乎从这里开始。 "form" 只有一个元素,它包含隐藏的输入。我要查询的字段接近尾声。看起来像这样...
[[1]]
<form> 'aspnetForm' (POST VideoReports.aspx)
<input hidden> '__VIEWSTATE': /wEPDwUKMTU1MTExNzA3NQ9kFgJmD2QWAgIDD2QWAgIBD2QWBAIBD2QWEgIDDw8WAh4EVGV4dAUOU2VwdGVtYmVyIDIwMTJkZAIFDw8WAh8ABQ1GZWJydWFyeSAyMDIwZGQCFQ9kFgICAw8QZBAVAg5TdW1tYXJ5IHJlcG9ydA1EZXRhaWwgcmVwb3J0FQIOU3VtbWFyeSByZXBvcnQNRGV0YWlsIHJlcG9ydBQrAwJnZ2RkAhcPZBYCAgMPEA8WBh4ORGF0YVZhbHVlRmllbGQFA0tleR4NRGF0YVRleHRGaWVsZAUFVmFsdWUeC18 ....[TRUNCATE]
最后,我得到了我想查询的内容...
<input radio> 'ctl00$MainPlaceHolder$SearchType': TypeStatewide
<input radio> 'ctl00$MainPlaceHolder$SearchType': TypeMuni
<input radio> 'ctl00$MainPlaceHolder$SearchType': TypeEst
<select> 'ctl00$MainPlaceHolder$SearchStateType' [1/2]
<select> 'ctl00$MainPlaceHolder$SearchMunicipality' [0/1069]
<select> 'ctl00$MainPlaceHolder$SearchEstablishment' [0/10182]
<input text> 'ctl00$MainPlaceHolder$SearchLicenseNumber':
<select> 'ctl00$MainPlaceHolder$SearchStartMonth' [1/12]
<select> 'ctl00$MainPlaceHolder$SearchStartYear' [1/9]
<select> 'ctl00$MainPlaceHolder$SearchEndMonth' [1/12]
<select> 'ctl00$MainPlaceHolder$SearchEndYear' [1/9]
<input radio> 'ctl00$MainPlaceHolder$ViewType': ViewPDF
<input radio> 'ctl00$MainPlaceHolder$ViewType': ViewCSV
我使用以下内容来达到我需要的...
igb_form <- getHTMLFormDescription(igb_html)
igb_form[[1]]
... 以及用于定位每个字段和值的代码。例如...
igb_form_att <- igb_form[[1]]
igb_form_att$elements[[9]]
...显示开始月份字段和下拉菜单中的值...
ctl00$MainPlaceHolder$SearchStartMonth: [ February ] January, February, March, April, May, June, July, August, September, October, November, December
我以为这样就可以了。所以我 运行 以下 ...
igb_fill <- set_values(igb_html,
'ctl00$MainPlaceHolder$SearchType' = 'TypeEst',
'ctl00$MainPlaceHolder$SearchEstablishment'='All Establishments',
'ctl00$MainPlaceHolder$SearchEstablishment' ='',
'ctl00$MainPlaceHolder$SearchStartMonth'='September',
'ctl00$MainPlaceHolder$SearchStartYear'='2009',
'ctl00$MainPlaceHolder$SearchEndMonth' ='February',
'ctl00$MainPlaceHolder$SearchEndYear'='2020',
'ctl00$MainPlaceHolder$ViewType'='ViewCSV')
submit_form(session=igb_html, form=igb_fill, POST(igb))
但是收到这个错误...
Error: Unknown field names: ctl00$MainPlaceHolder$SearchType, ctl00$MainPlaceHolder$SearchEstablishment, ctl00$MainPlaceHolder$SearchStartMonth, ctl00$MainPlaceHolder$SearchStartYear, ctl00$MainPlaceHolder$SearchEndMonth, ctl00$MainPlaceHolder$SearchEndYear, ctl00$MainPlaceHolder$ViewType
Traceback:
1. set_values(igb_form, `ctl00$MainPlaceHolder$SearchType` = "TypeEst",
. `ctl00$MainPlaceHolder$SearchEstablishment` = "All Establishments",
. `ctl00$MainPlaceHolder$SearchEstablishment` = "", `ctl00$MainPlaceHolder$SearchStartMonth` = "September",
. `ctl00$MainPlaceHolder$SearchStartYear` = "2009", `ctl00$MainPlaceHolder$SearchEndMonth` = "February",
. `ctl00$MainPlaceHolder$SearchEndYear` = "2020", `ctl00$MainPlaceHolder$ViewType` = "ViewCSV")
2. stop("Unknown field names: ", paste(no_match, collapse = ", "),
. call. = FALSE)
对于这个冗长的问题深表歉意,但我对此进行了很多探索,似乎无法找到可以帮助我到达需要去的地方的答案。也许我在我的头上。但我会很感激任何帮助! (我也很确定提交代码是错误的,但我可以在这之后解决。)
您的代码存在一些问题:
set_values(...)
函数采用一种形式,而不是整个 html,因此我将 igb_html
替换为 igb_form
。
submit_form(...)
函数需要一个 html_session
,所以我用 html_session(igb)
替换了 read_html(igb)
。
以下代码应该有效:
library(rvest)
igb <- "https://www.igb.illinois.gov/VideoReports.aspx"
igb_html <- html_session(igb)
igb_form <- html_form(igb_html)[[1]]
igb_fill <- set_values(igb_form,
'ctl00$MainPlaceHolder$SearchType' = 'TypeEst',
'ctl00$MainPlaceHolder$SearchEstablishment'='All Establishments',
'ctl00$MainPlaceHolder$SearchEstablishment' ='',
'ctl00$MainPlaceHolder$SearchStartMonth'='September',
'ctl00$MainPlaceHolder$SearchStartYear'='2009',
'ctl00$MainPlaceHolder$SearchEndMonth' ='February',
'ctl00$MainPlaceHolder$SearchEndYear'='2020',
'ctl00$MainPlaceHolder$ViewType'='ViewCSV')
igb_html <- submit_form(igb_html, igb_fill, submit = "ctl00$MainPlaceHolder$ButtonSearch")
igb_html
两天来,我一直在思考如何填写表格并提交表格以从 https://www.igb.illinois.gov/VideoReports.aspx 下载 .csv 文件。不幸的是,我似乎无法破解它。完全披露:我是新手网络抓取工具。我可以进行基本的抓取,但这对我来说是新领域。我希望最终能编写一个程序,将所有机构的月度收入报告拉回到 2009 年 9 月。
看来主要问题与表单的布局方式有关。我似乎无法弄清楚如何指定我想要填写的字段以请求 .csv 文件。我一直在使用 rvest
和 RHTMLForms
。我在 chrome 开发工具中找到了表格,可以看到我需要的一切。我似乎无法深入了解我需要去哪里提交查询。
这是我到目前为止的进展:
library('rvest')
library('RHTMLForms')
igb <- "https://www.igb.illinois.gov/VideoReports.aspx"
igb_html <- read_html(igb)
igbForm <- html_form(igb_html)
igbForm
问题似乎从这里开始。 "form" 只有一个元素,它包含隐藏的输入。我要查询的字段接近尾声。看起来像这样...
[[1]]
<form> 'aspnetForm' (POST VideoReports.aspx)
<input hidden> '__VIEWSTATE': /wEPDwUKMTU1MTExNzA3NQ9kFgJmD2QWAgIDD2QWAgIBD2QWBAIBD2QWEgIDDw8WAh4EVGV4dAUOU2VwdGVtYmVyIDIwMTJkZAIFDw8WAh8ABQ1GZWJydWFyeSAyMDIwZGQCFQ9kFgICAw8QZBAVAg5TdW1tYXJ5IHJlcG9ydA1EZXRhaWwgcmVwb3J0FQIOU3VtbWFyeSByZXBvcnQNRGV0YWlsIHJlcG9ydBQrAwJnZ2RkAhcPZBYCAgMPEA8WBh4ORGF0YVZhbHVlRmllbGQFA0tleR4NRGF0YVRleHRGaWVsZAUFVmFsdWUeC18 ....[TRUNCATE]
最后,我得到了我想查询的内容...
<input radio> 'ctl00$MainPlaceHolder$SearchType': TypeStatewide
<input radio> 'ctl00$MainPlaceHolder$SearchType': TypeMuni
<input radio> 'ctl00$MainPlaceHolder$SearchType': TypeEst
<select> 'ctl00$MainPlaceHolder$SearchStateType' [1/2]
<select> 'ctl00$MainPlaceHolder$SearchMunicipality' [0/1069]
<select> 'ctl00$MainPlaceHolder$SearchEstablishment' [0/10182]
<input text> 'ctl00$MainPlaceHolder$SearchLicenseNumber':
<select> 'ctl00$MainPlaceHolder$SearchStartMonth' [1/12]
<select> 'ctl00$MainPlaceHolder$SearchStartYear' [1/9]
<select> 'ctl00$MainPlaceHolder$SearchEndMonth' [1/12]
<select> 'ctl00$MainPlaceHolder$SearchEndYear' [1/9]
<input radio> 'ctl00$MainPlaceHolder$ViewType': ViewPDF
<input radio> 'ctl00$MainPlaceHolder$ViewType': ViewCSV
我使用以下内容来达到我需要的...
igb_form <- getHTMLFormDescription(igb_html)
igb_form[[1]]
... 以及用于定位每个字段和值的代码。例如...
igb_form_att <- igb_form[[1]]
igb_form_att$elements[[9]]
...显示开始月份字段和下拉菜单中的值...
ctl00$MainPlaceHolder$SearchStartMonth: [ February ] January, February, March, April, May, June, July, August, September, October, November, December
我以为这样就可以了。所以我 运行 以下 ...
igb_fill <- set_values(igb_html,
'ctl00$MainPlaceHolder$SearchType' = 'TypeEst',
'ctl00$MainPlaceHolder$SearchEstablishment'='All Establishments',
'ctl00$MainPlaceHolder$SearchEstablishment' ='',
'ctl00$MainPlaceHolder$SearchStartMonth'='September',
'ctl00$MainPlaceHolder$SearchStartYear'='2009',
'ctl00$MainPlaceHolder$SearchEndMonth' ='February',
'ctl00$MainPlaceHolder$SearchEndYear'='2020',
'ctl00$MainPlaceHolder$ViewType'='ViewCSV')
submit_form(session=igb_html, form=igb_fill, POST(igb))
但是收到这个错误...
Error: Unknown field names: ctl00$MainPlaceHolder$SearchType, ctl00$MainPlaceHolder$SearchEstablishment, ctl00$MainPlaceHolder$SearchStartMonth, ctl00$MainPlaceHolder$SearchStartYear, ctl00$MainPlaceHolder$SearchEndMonth, ctl00$MainPlaceHolder$SearchEndYear, ctl00$MainPlaceHolder$ViewType
Traceback:
1. set_values(igb_form, `ctl00$MainPlaceHolder$SearchType` = "TypeEst",
. `ctl00$MainPlaceHolder$SearchEstablishment` = "All Establishments",
. `ctl00$MainPlaceHolder$SearchEstablishment` = "", `ctl00$MainPlaceHolder$SearchStartMonth` = "September",
. `ctl00$MainPlaceHolder$SearchStartYear` = "2009", `ctl00$MainPlaceHolder$SearchEndMonth` = "February",
. `ctl00$MainPlaceHolder$SearchEndYear` = "2020", `ctl00$MainPlaceHolder$ViewType` = "ViewCSV")
2. stop("Unknown field names: ", paste(no_match, collapse = ", "),
. call. = FALSE)
对于这个冗长的问题深表歉意,但我对此进行了很多探索,似乎无法找到可以帮助我到达需要去的地方的答案。也许我在我的头上。但我会很感激任何帮助! (我也很确定提交代码是错误的,但我可以在这之后解决。)
您的代码存在一些问题:
set_values(...)
函数采用一种形式,而不是整个 html,因此我将igb_html
替换为igb_form
。submit_form(...)
函数需要一个html_session
,所以我用html_session(igb)
替换了read_html(igb)
。
以下代码应该有效:
library(rvest)
igb <- "https://www.igb.illinois.gov/VideoReports.aspx"
igb_html <- html_session(igb)
igb_form <- html_form(igb_html)[[1]]
igb_fill <- set_values(igb_form,
'ctl00$MainPlaceHolder$SearchType' = 'TypeEst',
'ctl00$MainPlaceHolder$SearchEstablishment'='All Establishments',
'ctl00$MainPlaceHolder$SearchEstablishment' ='',
'ctl00$MainPlaceHolder$SearchStartMonth'='September',
'ctl00$MainPlaceHolder$SearchStartYear'='2009',
'ctl00$MainPlaceHolder$SearchEndMonth' ='February',
'ctl00$MainPlaceHolder$SearchEndYear'='2020',
'ctl00$MainPlaceHolder$ViewType'='ViewCSV')
igb_html <- submit_form(igb_html, igb_fill, submit = "ctl00$MainPlaceHolder$ButtonSearch")
igb_html