使用下拉菜单中的选项从结果页面下载 CSV 文件

Question

我是使用 R 进行网络抓取的新手，我被这个问题困住了：我想使用 R 向 PubMed 提交搜索查询，然后从结果页面下载 CSV 文件。单击 'Send to' 可以访问 CSV 文件，这会打开一个下拉菜单，然后我需要 select 'File' 单选按钮，将 'Format' 选项更改为 'CSV'（选项6），最后点击'Create File'按钮开始下载。

一些注意事项：
1. 是的，这种远程搜索和下载符合NCBI的政策。
2. 为什么不使用 easyPubMed 包？我已经尝试过这个并且正在将它用于我工作的另一部分。但是，使用此包检索搜索结果会遗漏 CSV 下载包含的一些文章元数据。

我查看了这些相关问题：, , 。

我觉得@hrbrmstr 提供的前面的解决方案都包含了答案，但我就是无法拼凑起来下载 CSV 文件。

我认为这个问题的优雅解决方案是一个两步过程：1) POST 对 PubMed 的搜索请求和 GET 结果，以及 2) 提交第二个 POST 使用所需的选项请求结果页面（或以某种方式在其中导航）selected 以下载 CSV 文件。我尝试了以下玩具搜索查询（"hello world"，带引号，目前 returns 6 个结果）...

query <- '"hello world"'
url <- 'https://www.ncbi.nlm.nih.gov/pubmed/'

html_form(html_session(url)) # enter query using 'term'
# post search and retrieve results
session <- POST(url,body = list(term=query),encode='form')

# scrape results to check that above worked
content(session) %>% html_nodes('#maincontent > div > div:nth-child(5)') %>% 
  html_text()
content(session) %>% html_nodes('#maincontent > div > div:nth-child(5)') %>% 
  html_nodes('p') %>% html_text()

# view html nodes of dropdown menu -- how to 'click' these via R?
content(session) %>% html_nodes('#sendto > a')
content(session) %>% html_nodes('#send_to_menu > fieldset > ul > li:nth-child(1) > label')
content(session) %>% html_nodes('#file_format')
content(session) %>% html_nodes('#submenu_File > button')

# submit request to download CSV file
POST(session$url, # I know this doesn't work, but I would hope something similar is possible
     encode='form',
     body=list('EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.SendTo'='File',
               'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.FFormat'=6,
               'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.SendToSubmit'=1),
     write_disk('results.csv'))

上面最后一行失败 -- 下载了一个 CSV 文件，但它包含来自 POST 请求的 html 结果。理想情况下，如何编辑最后一行以获得所需的 CSV 文件？

***一个可能的黑客行为是直接跳到结果页面。换句话说，我知道提交 "hello world" 搜索 return 会得到以下 URL：https://www.ncbi.nlm.nih.gov/pubmed/?term=%22hello+world%22。因此，如果需要，我可以从这里推断并根据我的搜索查询构建结果 URLs。

我已经尝试将此 URL 插入到上面的行中，但它仍然没有 return 所需的 CSV 文件。我可以使用下面的命令查看表单字段...

# view form options on the results page
html_form(html_session('https://www.ncbi.nlm.nih.gov/pubmed/?term=%22hello+world%22'))

或者，我可以扩展 URL 知道上面的表格选项吗？像...

url2 <- 'https://www.ncbi.nlm.nih.gov/pubmed/?term=%22hello+world%22&EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.SendTo=File&EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.FFormat=6&EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.SendToSubmit=1'
POST(url2,write_disk('results2.csv'))

我希望下载一个包含 6 个包含文章元数据的结果的 CSV 文件，但是，我得到的是 html 个结果页面。

非常感谢任何帮助！谢谢。

Answer 1

如果我将您的问题重新定义为："I want to use R to submit a search query to PubMed and then download information that is the same as what is provided in the CSV download option on the results page."

然后，我认为你可以跳过抓取和网络 UI 自动化，直接进入 API that NIH has provided for this purpose。

此 R 代码的第一部分执行相同的搜索 ("hello world") 并以 JSON 格式获得相同的结果（随意粘贴 search_url link 在浏览器中验证）。

library(httr)
library(jsonlite)
library(tidyverse)

# Search for "hello world"
search_url <- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=%22hello+world%22&format=json"

# Search for results
search_result <- GET(search_url)

# Extract the content
search_content <- content(search_result, 
                          type = "application/json",
                          simplifyVector = TRUE)

# search_content$esearchresult$idlist
# [1] "29725961" "28103545" "27567633" "25955529" "22999052" "19674957"

# Get a vector of the search result IDs
result_ids <- search_content$esearchresult$idlist

# Get a summary for id 29725961 (the first one).
summary_url <- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&version=2.0&id=29725961&format=json" # 

summary_result <- GET(summary_url)

# Extract the content
summary_content <- content(summary_result, 
                          type = "application/json")

据推测，您可以从这里获取它，因为列表 summary_content 包含您需要的信息，只是格式不同（我通过目视检查进行了验证）。

但是，为了遵循您最初问题的精神（通过从 NCBI 中提取，使用 R 给我一个 CSV），您可以使用以下一些步骤来重现与您完全相同的 CSV可以从 PubMed Web UI 获得人类。

# Quickie cleanup (thanks to Tony ElHabr)
# https://www.r-bloggers.com/converting-nested-json-to-a-tidy-data-frame-with-r/
summary_untidy <- enframe(unlist(summary_content))

# Get rid of *some* of the fluff...
summary_tidy <- summary_untidy %>% 
  filter(grepl("result.29725961", name)) %>% 
  mutate(name = sub("result.29725961.", "", name))

# Convert the multiple author records into a single comma-separated string.
authors <- summary_tidy %>% 
  filter(grepl("^authors.name$", name)) %>% 
  summarize(pasted = paste(value, collapse = ", "))

# Begin to construct a data frame that has the same information as the downloadable CSV
summary_csv <- tibble(
  Title = summary_tidy %>% filter(name == "title") %>% pull(value),
  URL = sprintf("/pubmed/%s", summary_tidy %>% filter(name == "uid") %>% pull(value)),
  Description = pull(authors, pasted),
  Details = "... and so on, and so on, and so on... "
)

# Write the sample data frame to a csv.
write_csv(summary_csv, path = "just_like_the_search_page_csv.csv")

我不熟悉你提到的 easyPubMed 包，但 digging through the easyPubMed code 启发我使用 NCBI API。您完全有可能 fix/adapt 一些 easyPubMed 代码来提取您希望从提取一堆 CSV 中获得的额外元数据。（那里没有很多。只有 500 行代码定义了 8 个函数。）

哎呀，如果您设法调整 easyPubMed 代码以提取额外的元数据，我建议您将您的更改返回给作者，以便他们改进他们的包！

Answer 2

使用 easyPubMed 包：

library(easyPubMed)
out <- batch_pubmed_download(pubmed_query_string = "hello world")
DF <- table_articles_byAuth(pubmed_data = out[1])
write.csv(DF, "helloworld.csv")

有关详细信息，请参阅 easyPubMed 中的插图和帮助文件。

其他软件包是 pubmed.mineR，CRAN 上的 rentrez 和 RISMed，github 上的 Bioconductor 和 Rcupcake 注释。

使用下拉菜单中的选项从结果页面下载 CSV 文件

Download CSV file from results page with options from dropdown menu

html

r

web-scraping

httr

rvest