在网站上搜索 R 中的短语

Seach website for phrase in R

我想了解美国联邦政府正在开发哪些机器学习应用程序。联邦政府维护包含合同的网站 FedBizOps。可以在网站上搜索一个短语,例如"machine learning",以及一个日期范围,例如"last 365 days" 查找相关合同。结果搜索生成包含合同摘要的链接。

我希望能够从该网站提取合同摘要,给定搜索词和日期范围。

有什么方法可以将浏览器呈现的数据抓取到 R 中吗?关于网络抓取的类似问题 exists,但我不知道如何更改日期范围。

将信息拉入 R 后,我想用关键短语的气泡图来组织摘要。

这看起来像是一个通过 javascript 使用 XHR 来检索 URL 内容的网站,但事实并非如此。它只是一个普通的网站,可以通过标准 rvestxml2 调用(如 html_sessionread_html 轻松抓取。它 使 Location: URL 保持不变,所以它看起来有点像 XHR,甚至认为它不是。

但是,这是一个基于 <form> 的站点,这意味着您可以对社区慷慨大方,为 "hidden" API 编写一个 R 包装器,并可能将其捐赠给rOpenSci.

为此,我在 POST 请求的 "Copy as cURL" 内容上使用了 curlconverter 包,它提供了所有表单字段(似乎映射到大多数 - 如果并非所有 — 高级搜索 页面上的字段):

library(curlconverter)

make_req(straighten())[[1]] -> req

httr::VERB(verb = "POST", url = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list", 
    httr::add_headers(Pragma = "no-cache", 
        Origin = "https://www.fbo.gov", 
        `Accept-Encoding` = "gzip, deflate, br", 
        `Accept-Language` = "en-US,en;q=0.8", 
        `Upgrade-Insecure-Requests` = "1", 
        `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.41 Safari/537.36", 
        Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
        `Cache-Control` = "no-cache", 
        Referer = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list", 
        Connection = "keep-alive", 
        DNT = "1"), httr::set_cookies(PHPSESSID = "32efd3be67d43758adcc891c6f6814c4", 
        sympcsm_cookies_enabled = "1", 
        BALANCEID = "balancer.172.16.121.7"), 
    body = list(`dnf_class_values[procurement_notice][keywords]` = "machine+learning", 
        `dnf_class_values[procurement_notice][_posted_date]` = "365", 
        search_filters = "search", 
        `_____dummy` = "dnf_", 
        so_form_prefix = "dnf_", 
        dnf_opt_action = "search", 
        dnf_opt_template = "VVY2VDwtojnPpnGoobtUdzXxVYcDLoQW1MDkvvEnorFrm5k54q2OU09aaqzsSe6m", 
        dnf_opt_template_dir = "Pje8OihulaLVPaQ+C+xSxrG6WrxuiBuGRpBBjyvqt1KAkN/anUTlMWIUZ8ga9kY+", 
        dnf_opt_subform_template = "qNIkz4cr9hY8zJ01/MDSEGF719zd85B9", 
        dnf_opt_finalize = "0", 
        dnf_opt_mode = "update", 
        dnf_opt_target = "", dnf_opt_validate = "1", 
        `dnf_class_values[procurement_notice][dnf_class_name]` = "procurement_notice", 
        `dnf_class_values[procurement_notice][notice_id]` = "63ae1a97e9a5a9618fd541d900762e32", 
        `dnf_class_values[procurement_notice][posted]` = "", 
        `autocomplete_input_dnf_class_values[procurement_notice][agency]` = "", 
        `dnf_class_values[procurement_notice][agency]` = "", 
        `dnf_class_values[procurement_notice][zipstate]` = "", 
        `dnf_class_values[procurement_notice][procurement_type][]` = "", 
        `dnf_class_values[procurement_notice][set_aside][]` = "", 
        mode = "list"), encode = "form")

curlconverterhttr:: 前缀添加到各种函数中,因为您实际上可以使用 req() 来发出请求。这是一个真正的 R 函数。

但是,传入的大部分数据都是浏览器 "cruft",可以稍微缩减并移至 POST 请求中:

library(httr)
library(rvest)

POST(url = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list", 
     add_headers(Origin = "https://www.fbo.gov", 
                 Referer = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list"), 
     set_cookies(PHPSESSID = "32efd3be67d43758adcc891c6f6814c4", 
                 sympcsm_cookies_enabled = "1", 
                 BALANCEID = "balancer.172.16.121.7"), 
     body = list(`dnf_class_values[procurement_notice][keywords]` = "machine+learning", 
                 `dnf_class_values[procurement_notice][_posted_date]` = "365", 
                 search_filters = "search", 
                 `_____dummy` = "dnf_", 
                 so_form_prefix = "dnf_", 
                 dnf_opt_action = "search", 
                 dnf_opt_template = "VVY2VDwtojnPpnGoobtUdzXxVYcDLoQW1MDkvvEnorFrm5k54q2OU09aaqzsSe6m", 
                 dnf_opt_template_dir = "Pje8OihulaLVPaQ+C+xSxrG6WrxuiBuGRpBBjyvqt1KAkN/anUTlMWIUZ8ga9kY+", 
                 dnf_opt_subform_template = "qNIkz4cr9hY8zJ01/MDSEGF719zd85B9", 
                 dnf_opt_finalize = "0", 
                 dnf_opt_mode = "update", 
                 dnf_opt_target = "", dnf_opt_validate = "1", 
                 `dnf_class_values[procurement_notice][dnf_class_name]` = "procurement_notice", 
                 `dnf_class_values[procurement_notice][notice_id]` = "63ae1a97e9a5a9618fd541d900762e32", 
                 `dnf_class_values[procurement_notice][posted]` = "", 
                 `autocomplete_input_dnf_class_values[procurement_notice][agency]` = "", 
                 `dnf_class_values[procurement_notice][agency]` = "", 
                 `dnf_class_values[procurement_notice][zipstate]` = "", 
                 `dnf_class_values[procurement_notice][procurement_type][]` = "", 
                 `dnf_class_values[procurement_notice][set_aside][]` = "",
                 mode="list"), 
     encode = "form") -> res

这部分:

     set_cookies(PHPSESSID = "32efd3be67d43758adcc891c6f6814c4", 
                 sympcsm_cookies_enabled = "1", 
                 BALANCEID = "balancer.172.16.121.7")

让我觉得你应该在主 URL 上至少使用一次 html_sessionGET 来在缓存的 curl 处理程序中建立那些 cookie(这将是自动为您创建和维护)。

add_headers() 位可能也不是必需的,但这是留给 reader 的练习。

您可以通过以下方式找到您正在寻找的table:

content(res, as="text", encoding="UTF-8") %>% 
  read_html() %>% 
  html_nodes("table.list") %>% 
  html_table() %>% 
  dplyr::glimpse()
## Observations: 20
## Variables: 4
## $ Opportunity            <chr> "NSN: 1650-01-074-1054; FILTER ELEMENT, FLUID; WSIC: L SP...
## $ Agency/Office/Location <chr> "Defense Logistics Agency DLA Acquisition LocationsDLA Av...
## $ Type /  Set-aside      <chr> "Presolicitation", "Presolicitation", "Award", "Award", "...
## $ Posted On              <chr> "Sep 28, 2016", "Sep 28, 2016", "Sep 28, 2016", "Sep 28, ...

页面上有一个指示符,表明这些是结果 “2008 年第 1 - 20 期”。您还需要抓取它并处理分页结果。这也留作 reader.

的练习。