如何使用 R 或 Python 通过 Google Scholar query 下载学术论文的 PDF
How to download PDF of academic papers via Google Scholar query using R or Python
我有一份需要下载的学术论文标题列表。我想写一个循环从网上下载他们的 PDF 文件,但找不到办法。
这里是step-by-step我目前的想法(欢迎用R或者Python回答):
# Create list with paper titles (example with 4 papers from different journals)
titles <- c("Effect of interfacial properties on polymer–nanocrystal thermoelectric transport",
"Reducing social and environmental impacts of urban freight transport: A review of some major cities",
"Using Lorenz curves to assess public transport equity",
"Green infrastructure: The effects of urban rail transit on air quality")
#Loop step1 - Query paper title in Google Scholar to get URL of journal webpage containing the paper
#Loop step2 - Download the PDF from the journal webpage and save in your computer
for (i in titles){
journal_URL <- query i in google (scholar)
download.file (url = journal_URL, pattern = "pdf",
destfile=paste0(i,".pdf")
}
并发症:
循环step1 - Google Scholar的第一个命中应该是论文的原文URL。但是,我听说 Google Scholar 对 Bots 有点挑剔,所以另一种方法是查询 Google 并获得第一个 URL (跳跃它会带来正确的 URL)
循环第 2 步 - 有些论文是门控的,所以我想有必要包含身份验证信息(user=__,passwd=__)。但是,如果我使用的是我的大学网络,此身份验证应该是自动的,对吗?
ps。我只需要下载PDF。我对获取文献计量信息不感兴趣(例如引用记录,h-index)。对于获取文献计量数据,有一些指导 here (R users) and here (python users)。
Crossref 有一个程序,出版商可以为指向文章全文版本的链接提供元数据。不幸的是,对于像 Wiley、Elsevier 和 Springer 这样的出版商,他们可能会提供链接,但您需要额外的权限才能实际检索内容。好玩吧?无论如何,一些工作,例如,这适用于你的第二个标题,搜索交叉引用,然后获取全文的 URL(如果提供),然后获取 xml,(比 PDF 恕我直言更好)
titles <- c("Effect of interfacial properties on polymer–nanocrystal thermoelectric transport", "Reducing social and environmental impacts of urban freight transport: A review of some major cities", "Using Lorenz curves to assess public transport equity", "Green infrastructure: The effects of urban rail transit on air quality")
library("rcrossref")
out <- cr_search(titles[2])
doi <- sub("http://dx.doi.org/", "", out$doi[1])
(links <- cr_ft_links(doi, "all"))
$xml
<url> http://api.elsevier.com/content/article/PII:S1877042812005551?httpAccept=text/xml
$plain
<url> http://api.elsevier.com/content/article/PII:S1877042812005551?httpAccept=text/plain
xml <- cr_ft_text(links, "xml")
library("XML")
xpathApply(xml, "//ce:author")[[1]]
<ce:author>
<ce:degrees>Prof</ce:degrees>
<ce:given-name>Eiichi</ce:given-name>
<ce:surname>Taniguchi</ce:surname>
</ce:author>
我有一份需要下载的学术论文标题列表。我想写一个循环从网上下载他们的 PDF 文件,但找不到办法。
这里是step-by-step我目前的想法(欢迎用R或者Python回答):
# Create list with paper titles (example with 4 papers from different journals)
titles <- c("Effect of interfacial properties on polymer–nanocrystal thermoelectric transport",
"Reducing social and environmental impacts of urban freight transport: A review of some major cities",
"Using Lorenz curves to assess public transport equity",
"Green infrastructure: The effects of urban rail transit on air quality")
#Loop step1 - Query paper title in Google Scholar to get URL of journal webpage containing the paper
#Loop step2 - Download the PDF from the journal webpage and save in your computer
for (i in titles){
journal_URL <- query i in google (scholar)
download.file (url = journal_URL, pattern = "pdf",
destfile=paste0(i,".pdf")
}
并发症:
循环step1 - Google Scholar的第一个命中应该是论文的原文URL。但是,我听说 Google Scholar 对 Bots 有点挑剔,所以另一种方法是查询 Google 并获得第一个 URL (跳跃它会带来正确的 URL)
循环第 2 步 - 有些论文是门控的,所以我想有必要包含身份验证信息(user=__,passwd=__)。但是,如果我使用的是我的大学网络,此身份验证应该是自动的,对吗?
ps。我只需要下载PDF。我对获取文献计量信息不感兴趣(例如引用记录,h-index)。对于获取文献计量数据,有一些指导 here (R users) and here (python users)。
Crossref 有一个程序,出版商可以为指向文章全文版本的链接提供元数据。不幸的是,对于像 Wiley、Elsevier 和 Springer 这样的出版商,他们可能会提供链接,但您需要额外的权限才能实际检索内容。好玩吧?无论如何,一些工作,例如,这适用于你的第二个标题,搜索交叉引用,然后获取全文的 URL(如果提供),然后获取 xml,(比 PDF 恕我直言更好)
titles <- c("Effect of interfacial properties on polymer–nanocrystal thermoelectric transport", "Reducing social and environmental impacts of urban freight transport: A review of some major cities", "Using Lorenz curves to assess public transport equity", "Green infrastructure: The effects of urban rail transit on air quality")
library("rcrossref")
out <- cr_search(titles[2])
doi <- sub("http://dx.doi.org/", "", out$doi[1])
(links <- cr_ft_links(doi, "all"))
$xml
<url> http://api.elsevier.com/content/article/PII:S1877042812005551?httpAccept=text/xml
$plain
<url> http://api.elsevier.com/content/article/PII:S1877042812005551?httpAccept=text/plain
xml <- cr_ft_text(links, "xml")
library("XML")
xpathApply(xml, "//ce:author")[[1]]
<ce:author>
<ce:degrees>Prof</ce:degrees>
<ce:given-name>Eiichi</ce:given-name>
<ce:surname>Taniguchi</ce:surname>
</ce:author>