在网站上搜索 R 中的短语
Seach website for phrase in R
我想了解美国联邦政府正在开发哪些机器学习应用程序。联邦政府维护包含合同的网站 FedBizOps。可以在网站上搜索一个短语,例如"machine learning",以及一个日期范围,例如"last 365 days" 查找相关合同。结果搜索生成包含合同摘要的链接。
我希望能够从该网站提取合同摘要,给定搜索词和日期范围。
有什么方法可以将浏览器呈现的数据抓取到 R 中吗?关于网络抓取的类似问题 exists,但我不知道如何更改日期范围。
将信息拉入 R 后,我想用关键短语的气泡图来组织摘要。
这看起来像是一个通过 javascript 使用 XHR 来检索 URL 内容的网站,但事实并非如此。它只是一个普通的网站,可以通过标准 rvest
和 xml2
调用(如 html_session
和 read_html
轻松抓取。它 使 Location:
URL 保持不变,所以它看起来有点像 XHR,甚至认为它不是。
但是,这是一个基于 <form>
的站点,这意味着您可以对社区慷慨大方,为 "hidden" API 编写一个 R 包装器,并可能将其捐赠给rOpenSci.
为此,我在 POST
请求的 "Copy as cURL" 内容上使用了 curlconverter
包,它提供了所有表单字段(似乎映射到大多数 - 如果并非所有 — 高级搜索 页面上的字段):
library(curlconverter)
make_req(straighten())[[1]] -> req
httr::VERB(verb = "POST", url = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list",
httr::add_headers(Pragma = "no-cache",
Origin = "https://www.fbo.gov",
`Accept-Encoding` = "gzip, deflate, br",
`Accept-Language` = "en-US,en;q=0.8",
`Upgrade-Insecure-Requests` = "1",
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.41 Safari/537.36",
Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
`Cache-Control` = "no-cache",
Referer = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list",
Connection = "keep-alive",
DNT = "1"), httr::set_cookies(PHPSESSID = "32efd3be67d43758adcc891c6f6814c4",
sympcsm_cookies_enabled = "1",
BALANCEID = "balancer.172.16.121.7"),
body = list(`dnf_class_values[procurement_notice][keywords]` = "machine+learning",
`dnf_class_values[procurement_notice][_posted_date]` = "365",
search_filters = "search",
`_____dummy` = "dnf_",
so_form_prefix = "dnf_",
dnf_opt_action = "search",
dnf_opt_template = "VVY2VDwtojnPpnGoobtUdzXxVYcDLoQW1MDkvvEnorFrm5k54q2OU09aaqzsSe6m",
dnf_opt_template_dir = "Pje8OihulaLVPaQ+C+xSxrG6WrxuiBuGRpBBjyvqt1KAkN/anUTlMWIUZ8ga9kY+",
dnf_opt_subform_template = "qNIkz4cr9hY8zJ01/MDSEGF719zd85B9",
dnf_opt_finalize = "0",
dnf_opt_mode = "update",
dnf_opt_target = "", dnf_opt_validate = "1",
`dnf_class_values[procurement_notice][dnf_class_name]` = "procurement_notice",
`dnf_class_values[procurement_notice][notice_id]` = "63ae1a97e9a5a9618fd541d900762e32",
`dnf_class_values[procurement_notice][posted]` = "",
`autocomplete_input_dnf_class_values[procurement_notice][agency]` = "",
`dnf_class_values[procurement_notice][agency]` = "",
`dnf_class_values[procurement_notice][zipstate]` = "",
`dnf_class_values[procurement_notice][procurement_type][]` = "",
`dnf_class_values[procurement_notice][set_aside][]` = "",
mode = "list"), encode = "form")
curlconverter
将 httr::
前缀添加到各种函数中,因为您实际上可以使用 req()
来发出请求。这是一个真正的 R 函数。
但是,传入的大部分数据都是浏览器 "cruft",可以稍微缩减并移至 POST
请求中:
library(httr)
library(rvest)
POST(url = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list",
add_headers(Origin = "https://www.fbo.gov",
Referer = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list"),
set_cookies(PHPSESSID = "32efd3be67d43758adcc891c6f6814c4",
sympcsm_cookies_enabled = "1",
BALANCEID = "balancer.172.16.121.7"),
body = list(`dnf_class_values[procurement_notice][keywords]` = "machine+learning",
`dnf_class_values[procurement_notice][_posted_date]` = "365",
search_filters = "search",
`_____dummy` = "dnf_",
so_form_prefix = "dnf_",
dnf_opt_action = "search",
dnf_opt_template = "VVY2VDwtojnPpnGoobtUdzXxVYcDLoQW1MDkvvEnorFrm5k54q2OU09aaqzsSe6m",
dnf_opt_template_dir = "Pje8OihulaLVPaQ+C+xSxrG6WrxuiBuGRpBBjyvqt1KAkN/anUTlMWIUZ8ga9kY+",
dnf_opt_subform_template = "qNIkz4cr9hY8zJ01/MDSEGF719zd85B9",
dnf_opt_finalize = "0",
dnf_opt_mode = "update",
dnf_opt_target = "", dnf_opt_validate = "1",
`dnf_class_values[procurement_notice][dnf_class_name]` = "procurement_notice",
`dnf_class_values[procurement_notice][notice_id]` = "63ae1a97e9a5a9618fd541d900762e32",
`dnf_class_values[procurement_notice][posted]` = "",
`autocomplete_input_dnf_class_values[procurement_notice][agency]` = "",
`dnf_class_values[procurement_notice][agency]` = "",
`dnf_class_values[procurement_notice][zipstate]` = "",
`dnf_class_values[procurement_notice][procurement_type][]` = "",
`dnf_class_values[procurement_notice][set_aside][]` = "",
mode="list"),
encode = "form") -> res
这部分:
set_cookies(PHPSESSID = "32efd3be67d43758adcc891c6f6814c4",
sympcsm_cookies_enabled = "1",
BALANCEID = "balancer.172.16.121.7")
让我觉得你应该在主 URL 上至少使用一次 html_session
或 GET
来在缓存的 curl
处理程序中建立那些 cookie(这将是自动为您创建和维护)。
add_headers()
位可能也不是必需的,但这是留给 reader 的练习。
您可以通过以下方式找到您正在寻找的table:
content(res, as="text", encoding="UTF-8") %>%
read_html() %>%
html_nodes("table.list") %>%
html_table() %>%
dplyr::glimpse()
## Observations: 20
## Variables: 4
## $ Opportunity <chr> "NSN: 1650-01-074-1054; FILTER ELEMENT, FLUID; WSIC: L SP...
## $ Agency/Office/Location <chr> "Defense Logistics Agency DLA Acquisition LocationsDLA Av...
## $ Type / Set-aside <chr> "Presolicitation", "Presolicitation", "Award", "Award", "...
## $ Posted On <chr> "Sep 28, 2016", "Sep 28, 2016", "Sep 28, 2016", "Sep 28, ...
页面上有一个指示符,表明这些是结果 “2008 年第 1 - 20 期”。您还需要抓取它并处理分页结果。这也留作 reader.
的练习。
我想了解美国联邦政府正在开发哪些机器学习应用程序。联邦政府维护包含合同的网站 FedBizOps。可以在网站上搜索一个短语,例如"machine learning",以及一个日期范围,例如"last 365 days" 查找相关合同。结果搜索生成包含合同摘要的链接。
我希望能够从该网站提取合同摘要,给定搜索词和日期范围。
有什么方法可以将浏览器呈现的数据抓取到 R 中吗?关于网络抓取的类似问题 exists,但我不知道如何更改日期范围。
将信息拉入 R 后,我想用关键短语的气泡图来组织摘要。
这看起来像是一个通过 javascript 使用 XHR 来检索 URL 内容的网站,但事实并非如此。它只是一个普通的网站,可以通过标准 rvest
和 xml2
调用(如 html_session
和 read_html
轻松抓取。它 使 Location:
URL 保持不变,所以它看起来有点像 XHR,甚至认为它不是。
但是,这是一个基于 <form>
的站点,这意味着您可以对社区慷慨大方,为 "hidden" API 编写一个 R 包装器,并可能将其捐赠给rOpenSci.
为此,我在 POST
请求的 "Copy as cURL" 内容上使用了 curlconverter
包,它提供了所有表单字段(似乎映射到大多数 - 如果并非所有 — 高级搜索 页面上的字段):
library(curlconverter)
make_req(straighten())[[1]] -> req
httr::VERB(verb = "POST", url = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list",
httr::add_headers(Pragma = "no-cache",
Origin = "https://www.fbo.gov",
`Accept-Encoding` = "gzip, deflate, br",
`Accept-Language` = "en-US,en;q=0.8",
`Upgrade-Insecure-Requests` = "1",
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.41 Safari/537.36",
Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
`Cache-Control` = "no-cache",
Referer = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list",
Connection = "keep-alive",
DNT = "1"), httr::set_cookies(PHPSESSID = "32efd3be67d43758adcc891c6f6814c4",
sympcsm_cookies_enabled = "1",
BALANCEID = "balancer.172.16.121.7"),
body = list(`dnf_class_values[procurement_notice][keywords]` = "machine+learning",
`dnf_class_values[procurement_notice][_posted_date]` = "365",
search_filters = "search",
`_____dummy` = "dnf_",
so_form_prefix = "dnf_",
dnf_opt_action = "search",
dnf_opt_template = "VVY2VDwtojnPpnGoobtUdzXxVYcDLoQW1MDkvvEnorFrm5k54q2OU09aaqzsSe6m",
dnf_opt_template_dir = "Pje8OihulaLVPaQ+C+xSxrG6WrxuiBuGRpBBjyvqt1KAkN/anUTlMWIUZ8ga9kY+",
dnf_opt_subform_template = "qNIkz4cr9hY8zJ01/MDSEGF719zd85B9",
dnf_opt_finalize = "0",
dnf_opt_mode = "update",
dnf_opt_target = "", dnf_opt_validate = "1",
`dnf_class_values[procurement_notice][dnf_class_name]` = "procurement_notice",
`dnf_class_values[procurement_notice][notice_id]` = "63ae1a97e9a5a9618fd541d900762e32",
`dnf_class_values[procurement_notice][posted]` = "",
`autocomplete_input_dnf_class_values[procurement_notice][agency]` = "",
`dnf_class_values[procurement_notice][agency]` = "",
`dnf_class_values[procurement_notice][zipstate]` = "",
`dnf_class_values[procurement_notice][procurement_type][]` = "",
`dnf_class_values[procurement_notice][set_aside][]` = "",
mode = "list"), encode = "form")
curlconverter
将 httr::
前缀添加到各种函数中,因为您实际上可以使用 req()
来发出请求。这是一个真正的 R 函数。
但是,传入的大部分数据都是浏览器 "cruft",可以稍微缩减并移至 POST
请求中:
library(httr)
library(rvest)
POST(url = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list",
add_headers(Origin = "https://www.fbo.gov",
Referer = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list"),
set_cookies(PHPSESSID = "32efd3be67d43758adcc891c6f6814c4",
sympcsm_cookies_enabled = "1",
BALANCEID = "balancer.172.16.121.7"),
body = list(`dnf_class_values[procurement_notice][keywords]` = "machine+learning",
`dnf_class_values[procurement_notice][_posted_date]` = "365",
search_filters = "search",
`_____dummy` = "dnf_",
so_form_prefix = "dnf_",
dnf_opt_action = "search",
dnf_opt_template = "VVY2VDwtojnPpnGoobtUdzXxVYcDLoQW1MDkvvEnorFrm5k54q2OU09aaqzsSe6m",
dnf_opt_template_dir = "Pje8OihulaLVPaQ+C+xSxrG6WrxuiBuGRpBBjyvqt1KAkN/anUTlMWIUZ8ga9kY+",
dnf_opt_subform_template = "qNIkz4cr9hY8zJ01/MDSEGF719zd85B9",
dnf_opt_finalize = "0",
dnf_opt_mode = "update",
dnf_opt_target = "", dnf_opt_validate = "1",
`dnf_class_values[procurement_notice][dnf_class_name]` = "procurement_notice",
`dnf_class_values[procurement_notice][notice_id]` = "63ae1a97e9a5a9618fd541d900762e32",
`dnf_class_values[procurement_notice][posted]` = "",
`autocomplete_input_dnf_class_values[procurement_notice][agency]` = "",
`dnf_class_values[procurement_notice][agency]` = "",
`dnf_class_values[procurement_notice][zipstate]` = "",
`dnf_class_values[procurement_notice][procurement_type][]` = "",
`dnf_class_values[procurement_notice][set_aside][]` = "",
mode="list"),
encode = "form") -> res
这部分:
set_cookies(PHPSESSID = "32efd3be67d43758adcc891c6f6814c4",
sympcsm_cookies_enabled = "1",
BALANCEID = "balancer.172.16.121.7")
让我觉得你应该在主 URL 上至少使用一次 html_session
或 GET
来在缓存的 curl
处理程序中建立那些 cookie(这将是自动为您创建和维护)。
add_headers()
位可能也不是必需的,但这是留给 reader 的练习。
您可以通过以下方式找到您正在寻找的table:
content(res, as="text", encoding="UTF-8") %>%
read_html() %>%
html_nodes("table.list") %>%
html_table() %>%
dplyr::glimpse()
## Observations: 20
## Variables: 4
## $ Opportunity <chr> "NSN: 1650-01-074-1054; FILTER ELEMENT, FLUID; WSIC: L SP...
## $ Agency/Office/Location <chr> "Defense Logistics Agency DLA Acquisition LocationsDLA Av...
## $ Type / Set-aside <chr> "Presolicitation", "Presolicitation", "Award", "Award", "...
## $ Posted On <chr> "Sep 28, 2016", "Sep 28, 2016", "Sep 28, 2016", "Sep 28, ...
页面上有一个指示符,表明这些是结果 “2008 年第 1 - 20 期”。您还需要抓取它并处理分页结果。这也留作 reader.
的练习。