使用 R 接受 cookie 下载 PDF 文件

Using R to accept cookies to download a PDF file

我在尝试下载 PDF 时卡在了 cookie 上。

例如,如果我有一个 DOI for a PDF document on the Archaeology Data Service, it will resolve to this landing pageembedded link in it to this pdf but which really redirects to this 其他 link.

library(httr) 将处理解析 DOI,我们可以使用 library(XML) 从着陆页提取 pdf URL,但我坚持获取 PDF 本身。

如果我这样做:

download.file("http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf", destfile = "tmp.pdf")

然后我收到一个与 http://archaeologydataservice.ac.uk/myads/

相同的 HTML 文件

尝试 How to use R to download a zipped file from a SSL page that requires cookies 的答案让我想到了这个:

library(httr)

terms <- "http://archaeologydataservice.ac.uk/myads/copyrights"
download <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload"
values <- list(agree = "yes", t = "arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf")

# Accept the terms on the form,
# generating the appropriate cookies

POST(terms, body = values)
GET(download, query = values)

# Actually download the file (this will take a while)

resp <- GET(download, query = values)

# write the content of the download to a binary file

writeBin(content(resp, "raw"), "c:/temp/thefile.zip")

但是在 POSTGET 函数之后,我只得到 HTML 我用 download.file:

得到的同一个 cookie 页面
> GET(download, query = values)
Response [http://archaeologydataservice.ac.uk/myads/copyrights?from=2f6172636869766544532f61726368697665446f776e6c6f61643f61677265653d79657326743d617263682d313335322d3125324664697373656d696e6174696f6e2532467064662532464479666564253246474c34343030342e706466]
  Date: 2016-01-06 00:35
  Status: 200
  Content-Type: text/html;charset=UTF-8
  Size: 21 kB
<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "h...
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
        <head>
            <meta http-equiv="Content-Type" content="text/html; c...


            <title>Archaeology Data Service:  myADS</title>

            <link href="http://archaeologydataservice.ac.uk/css/u...
...

正在查看http://archaeologydataservice.ac.uk/about/Cookies it seems that the cookie situation at this site is complicated. Seems like this kind of cookie complexity is not unusual for UK data providers: automating the login to the uk data service website in R with RCurl or httr

我如何使用 R 来绕过这个网站上的 cookies?

您对 rOpenSci 的请求已被听到!

这些页面之间有很多 javascript,这使得尝试通过 httr + rvest 进行破译有些烦人。尝试 RSelenium。这适用于 OS X 10.11.2、R 3.2.3 和加载的 Firefox。

library(RSelenium)

# check if a sever is present, if not, get a server
checkForServer()

# get the server going
startServer()

dir.create("~/justcreateddir")
setwd("~/justcreateddir")

# we need PDFs to download instead of display in-browser
prefs <- makeFirefoxProfile(list(
  `browser.download.folderList` = as.integer(2),
  `browser.download.dir` = getwd(),
  `pdfjs.disabled` = TRUE,
  `plugin.scan.plid.all` = FALSE,
  `plugin.scan.Acrobat` = "99.0",
  `browser.helperApps.neverAsk.saveToDisk` = 'application/pdf'
))
# get a browser going
dr <- remoteDriver$new(extraCapabilities=prefs)
dr$open()

# go to the page with the PDF
dr$navigate("http://archaeologydataservice.ac.uk/archives/view/greylit/details.cfm?id=17755")

# find the PDF link and "hit ENTER"
pdf_elem <- dr$findElement(using="css selector", "a.dlb3")
pdf_elem$sendKeysToElement(list("\uE007"))

# find the ACCEPT button and "hit ENTER"
# that will save the PDF to the default downloads directory
accept_elem <- dr$findElement(using="css selector", "a[id$='agreeButton']")
accept_elem$sendKeysToElement(list("\uE007"))

现在等待下载完成。 R 控制台在下载时不会忙,因此很容易在下载完成之前意外关闭会话。

# close the session
dr$close()

这个答案来自 John Harrison 的电子邮件,应他的要求张贴在这里:

这将允许您下载 PDF:

appURL <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf"
library(RCurl)
library(XML)
curl = getCurlHandle()
curlSetOpt(cookiefile="cookies.txt"
           , curl=curl, followLocation = TRUE)
pdfData <- getBinaryURL(appURL, curl = curl, .opts = list(cookie = "ADSCOPYRIGHT=YES"))
writeBin(pdfData, "test2.pdf")

这是展示他作品的更长版本

appURL <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf"
library(RCurl)
library(XML)
curl = getCurlHandle()
curlSetOpt(cookiefile="cookies.txt"
           , curl=curl, followLocation = TRUE)
appData <- getURL(appURL, curl = curl)

# get the necessary elements for the POST that is initiated when the ACCEPT button is pressed

doc <- htmlParse(appData)
appAttrs <- doc["//input", fun = xmlAttrs]
postData <- lapply(appAttrs, function(x){data.frame(name = x[["name"]], value = x[["value"]]
                                                    , stringsAsFactors = FALSE)})
postData <- do.call(rbind, postData)

# post your acceptance
postURL <- "http://archaeologydataservice.ac.uk/myads/copyrights.jsf;jsessionid="
# get jsessionid
jsessionid <- unlist(strsplit(getCurlInfo(curl)$cookielist[1], "\t"))[7]

searchData <- postForm(paste0(postURL, jsessionid), curl = curl,
                       "j_id10" = "j_id10",
                       from = postData[postData$name == "from", "value"],
                       "javax.faces.ViewState" = postData[postData$name == "javax.faces.ViewState", "value"],
                       "j_id10:_idcl" = "j_id10:agreeButton"
                       , binary = TRUE
)
con <- file("test.pdf", open = "wb")
writeBin(searchData, con)
close(con)


Pressing the ACCEPT button on the page you gave initiates a POST to "http://archaeologydataservice.ac.uk/myads/copyrights.jsf;jsessionid=......" via some javascript.
This post then redirects to the page with the pdf having given some additional cookies.

Checking our cookies we see:

> getCurlInfo(curl)$cookielist
[1] "archaeologydataservice.ac.uk\tFALSE\t/\tFALSE\t0\tJSESSIONID\t3d249e3d7c98ec35998e69e15d3e" 
[2] "archaeologydataservice.ac.uk\tFALSE\t/\tFALSE\t0\tSSOSESSIONID\t3d249e3d7c98ec35998e69e15d3e"
[3] "archaeologydataservice.ac.uk\tFALSE\t/\tFALSE\t0\tADSCOPYRIGHT\tYES"          

so it would probably be sufficient to set that last cookie to start with (indicating we accept copyright)