如何下载 and/or 提取存储在 R 中响应对象内的 'raw' 二进制 zip 对象中的数据?

How to download and/or extract data stored in a 'raw' binary zip object within a response object in R?

我无法使用 httr 包从 API 请求下载或读取 zip 文件。是否有另一个我可以尝试的包允许我 download/read 二进制 zip 文件存储在 R 中的获取请求的响应中?

我尝试了两种方法:

  1. 使用GET得到一个application/json类型的响应对象(成功)然后使用fromJSON提取内容使用content(my_response,'text')。输出包括一个名为 'zip' 的列,这是我有兴趣下载的数据,其文档说明是一个 base64 编码的二进制文件。此列目前是一串非常长的随机字母,我不确定如何将其转换为实际数据集。

  2. 我尝试使用 fromJSON 绕过,因为我注意到响应对象本身有一个 class 'raw' 字段。这个对象是一个随机数列表,我怀疑它是数据集的二进制表示。我尝试使用 rawToChar(my_response$content) 尝试将原始数据类型转换为字符,但这会导致生成与 #1.

  3. 中相同的长字符串
  4. 我注意到使用方法 #1,如果我使用 base64_dec() 尝试转换长字符串,我也会得到与响应中的 'raw' 字段相同类型的输出对象本身。
getzip1  <- GET(getzip1_link)
getzip1 # successful response, status 200
df <- fromJSON(content(getzip1, "text"))

df$status # "OK"
df$dataset$zip # <- this is the very long string of letters (eg. "I1NC5qc29uUEsBAhQDFA...")

# Method 1: try to convert from the 'zip' object in the output of fromJSON
try1 <- base64_dec(df$dataset$zip)
#looks similar to getzip1$content (i.e.  this produces the list of numbers/letters 50 4b 03 04 14 00, etc, perhaps binary representation)

# Method 2: try to get data directly from raw object
class(getzip1$content) # <- 'raw' class object directly from GET request
try2 <- rawToChar(getzip1$content) #returns same output as df$data$zip


我应该能够使用我的响应中的原始 'content' 对象或 fromJSON 输出的 'zip' 对象中的长字符串来查看数据集或以某种方式下载它。我不知道该怎么做。请帮忙!

欢迎!

基于 documentation API 对 getDataset 端点的响应具有模式

Dataset archive including meta information, the dataset itself is base64 encoded to allow for binary ZIP transfers.

{
 "status": "OK",
 "dataset": {
 "state_id": 5,
 "session_id": 1624,
 "session_name": "2019-2020 Regular Session",
 "dataset_hash": "1c7d77fe298a4d30ad763733ab2f8c84",
 "dataset_date": "2018-12-23",
 "dataset_size": 317775,
 "mime": "application\/zip",
 "zip": "MIME 64 Encoded Document"
 }
}

我们可以通过以下代码使用R获取数据,

library(httr)
library(jsonlite)
library(stringr)
library(maditr)
token <- "" # Your API key
session_id <- 1253L # Obtained from the getDatasetList endpoint
access_key <- "2qAtLbkQiJed9Z0FxyRblu" # Obtained from the getDatasetList endpoint
destfile <- file.path("path", "to", "file.zip") # Modify
response <- str_c("https://api.legiscan.com/?key=",
                  token,
                  "&op=getDataset&id=",
                  session_id,
                  "&access_key=",
                  access_key) %>%
  GET()
status_code(x = response) == 200 # Good
body <- content(x = response,
                as = "text",
                encoding = "utf8") %>%
  fromJSON() # This contains some extra metadata
content(x = response,
        as = "text",
        encoding = "utf8") %>%
  fromJSON() %>%
  getElement(name = "dataset") %>%
  getElement(name = "zip") %>%
  base64_dec() %>%
  writeBin(con = destfile)
unzip(zipfile = destfile)

unzip 将解压缩文件,在本例中将类似于

hash.md5 # Can be checked against the metadata
AL/2016-2016_1st_Special_Session/bill/*.json
AL/2016-2016_1st_Special_Session/people/*.json
AL/2016-2016_1st_Special_Session/vote/*.json

一如既往,将您的代码包装在函数和利润中。

PS:下面是代码在 Julia 中的样子,作为比较。

using Base64, HTTP, JSON3, CodecZlib
token = "" # Your API key
session_id = 1253 # Obtained from the getDatasetList endpoint
access_key = "2qAtLbkQiJed9Z0FxyRblu" # Obtained from the getDatasetList endpoint
destfile = joinpath("path", "to", "file.zip") # Modify
response = string("https://api.legiscan.com/?",
                  join(["key=$token",
                        "op=getDataset",
                        "id=$session_id",
                        "access_key=$access_key"],
                        "&")) |>
    HTTP.get
@assert response.status == 200
JSON3.read(response.body) |>
    (content -> content.dataset.zip) |>
    base64decode |>
    (data -> write(destfile, data))
run(pipeline(`unzip`, destfile))