使用 R 和 httr 检索数据 - Okta API GET 请求与分页 link headers
Using R and httr to retrieve data - Okta API GET request with paginated link headers
我正在尝试使用 RStudio / Hadley Wickham 'httr' R 包来 return 来自 Okta API GET 请求('List Users Assigned to Application')的所有记录。以下请求非常适合获取每次调用的最大记录限制 (500):
oktaurl <- "https://mydomain.okta.com/api/v1/apps/applicationID/users?limit=500"
oktagetjson <- with_verbose(content(GET(oktaurl,
add_headers("Authorization" = "bearer myapikey",
"Content-Type" = "application/json;charset=UTF-8"))))
使用'jsonlite'和R将'oktagetjson' return编辑的数据解析成可用的数据帧不是问题;但是,这个特定的 API 调用很难限制为每次调用最多 500 条记录,因此我需要以某种方式检索和分页所有 'Link:' headers 以获得所有几千条记录。 'Link:' headers 本身的形式为:
Link: <https://mydomain.okta.com/api/v1/apps/applicationID/users?limit=500>; rel="self"
Link: <https://mydomain.okta.com/api/v1/apps/applicationID/users?after=random cursor string&limit=500>; rel="next"
(Okta API 文档描述了它们的分页结构 here)
我被困在这里:
- 在调用 'oktagetjson <- with_verbose(content(GET(oktaurl, etc ... ) ...) ' 获取我的 oktagetjston object 时,我可以在 R / RStudio 控制台中看到上面列出的前两个分页 'Link:' headers 但是'Link:' headers 未作为 object 本身的一部分 return 编辑。调用
headers(HEAD("https://mydomain.okta.com/api/v1/apps/<applicationID>/users"))
return 一些 headers 但不 return 分页 'Link:' headers
- 'Link:' headers 包含随机游标字符串,所以我无法猜测它们的实际格式
- 即使我可以检索所有必需的 'Link:' headers,我也不知道如何在 R 中调用/迭代/分页/递归地跟进所有这些以构建一个 object 包含数千条记录的整个数据集。
不幸的是,由于请求、服务提供商和数据的性质,我无法提供具有真实链接和样本数据的完全可重现的示例,但我希望这个概念足够清楚,以便有人能为我指明正确的方向 -即使那个方向是不使用 'httr' 包或 R 来完成这项工作。
感谢您的考虑。
前一段时间一起破解了一些东西,但肯定不会赢得任何优雅奖。对其进行了修改,以便将用户也分配给 Okta 应用程序。如果您正在审核/加入其他公司/目录数据,则很有用。
library(jsonlite)
library(dplyr)
library(httr)
library(purrr)
library(stringi)
library(tidyr)
# create character vector to hold URLs we'll use later when we GET content
url_list <- as.character()
# list placeholder for GET content
okta_content <- list()
# initial URL construction parts for first URL
okta_urllimit = as.character("200")
okta_baseurl <- paste0("https://<your company>.okta.com/api/v1/users?limit=",okta_urllimit)
# next URL construction parts for 'next' URLs
basenexturl <- "https://<your company>.okta.com/api/v1/users?after="
baselimiturl <- "&limit=200"
# Pass initial URL to get first batch
okta_get01 <- httr::GET(okta_baseurl,
config = (
add_headers(Authorization = "SSWS <your Okta API key>")))
# append the URL vector
url_list <- append(url_list, okta_baseurl)
# unlist the all_headers list element from the URL
testallheaders <- as.character(unlist(okta_get01$all_headers))
okta_content <- append(okta_content,content(okta_get01))
# if "next" is in the second link URL (testallheaders[16]) then iterate for as long as
# the next URL header element has "next" in it
while (
grepl("next",testallheaders[16]) == 'TRUE'
)
{
# parse the sha value
testparsenext <- regmatches(testallheaders[16], gregexpr('(?<=after=).*?(?=&limit)',testallheaders[16], perl=T))[[1]]
# and create URL
oktaurlnext <- paste0(basenexturl,testparsenext,baselimiturl)
# iterate and replace 'okta_baseurl' with each subsquent oktaurlnext
okta_get01 <- httr::GET(oktaurlnext,
config = (
add_headers(Authorization = "SSWS <your Okta API key>")))
testallheaders <- as.character(unlist(okta_get01$all_headers))
url_list <- append(url_list, oktaurlnext)
okta_content <- append(okta_content,content(okta_get01))
next
}
# Parse the results into something usable
oktagettojson <- toJSON(okta_content, simplifyDataFrame = TRUE, flatten = TRUE, recursive = TRUE)
oktagetdf <- fromJSON(oktagettojson, simplifyDataFrame = TRUE, flatten = TRUE)
dfnames <- names(oktagetdf)
oktagetdf <- oktagetdf %>% map_if(is.list, as.character)
oktagetdf <- do.call(cbind, lapply(oktagetdf, data.frame, stringsAsFactors=FALSE))
names(oktagetdf) <- dfnames
# adding columns to separate AD domain mastered account and domain names
oktagetdf <- separate(oktagetdf, profile.login,
into = c("credPrefix", "credSuffix"), sep = "@", remove = FALSE, extra = "drop")
# select some data frame columns of interest
okta_allusers <- subset(oktagetdf, select = c("id","status","created","lastLogin","profile.login","credPrefix", "credSuffix","profile.firstName","profile.lastName","profile.email","credentials.provider.type","credentials.provider.name"))
刚刚花了一些时间解决这个问题,想分享一个更简单的替代解决方案。在没有一些额外逻辑的情况下找不到正则表达式下一个 link 的好方法,但假设您的 API returns 完全形成可以遵循的 link 则可以重复使用。
library(jsonlite)
library(httr)
library(stringr)
res <- GET(<yourURL>,token)
resDF <- fromJSON(httr::content(res, as = "text"))
while (grepl("next", res$headers$link) == 'TRUE')
{
res <- GET(
ifelse(grepl("prev", res$headers$link) == 'TRUE',
str_match(res$headers$link, "prev, <(.*)>; rel=next")[1,2]
,
str_match(res$headers$link, "first, <(.*)>; rel=next")[1,2]
)
,token)
resDF <- rbind(resDF, fromJSON(httr::content(res, as = "text")))
}
我正在尝试使用 RStudio / Hadley Wickham 'httr' R 包来 return 来自 Okta API GET 请求('List Users Assigned to Application')的所有记录。以下请求非常适合获取每次调用的最大记录限制 (500):
oktaurl <- "https://mydomain.okta.com/api/v1/apps/applicationID/users?limit=500"
oktagetjson <- with_verbose(content(GET(oktaurl,
add_headers("Authorization" = "bearer myapikey",
"Content-Type" = "application/json;charset=UTF-8"))))
使用'jsonlite'和R将'oktagetjson' return编辑的数据解析成可用的数据帧不是问题;但是,这个特定的 API 调用很难限制为每次调用最多 500 条记录,因此我需要以某种方式检索和分页所有 'Link:' headers 以获得所有几千条记录。 'Link:' headers 本身的形式为:
Link: <https://mydomain.okta.com/api/v1/apps/applicationID/users?limit=500>; rel="self"
Link: <https://mydomain.okta.com/api/v1/apps/applicationID/users?after=random cursor string&limit=500>; rel="next"
(Okta API 文档描述了它们的分页结构 here)
我被困在这里:
- 在调用 'oktagetjson <- with_verbose(content(GET(oktaurl, etc ... ) ...) ' 获取我的 oktagetjston object 时,我可以在 R / RStudio 控制台中看到上面列出的前两个分页 'Link:' headers 但是'Link:' headers 未作为 object 本身的一部分 return 编辑。调用
headers(HEAD("https://mydomain.okta.com/api/v1/apps/<applicationID>/users"))
return 一些 headers 但不 return 分页 'Link:' headers - 'Link:' headers 包含随机游标字符串,所以我无法猜测它们的实际格式
- 即使我可以检索所有必需的 'Link:' headers,我也不知道如何在 R 中调用/迭代/分页/递归地跟进所有这些以构建一个 object 包含数千条记录的整个数据集。
不幸的是,由于请求、服务提供商和数据的性质,我无法提供具有真实链接和样本数据的完全可重现的示例,但我希望这个概念足够清楚,以便有人能为我指明正确的方向 -即使那个方向是不使用 'httr' 包或 R 来完成这项工作。
感谢您的考虑。
前一段时间一起破解了一些东西,但肯定不会赢得任何优雅奖。对其进行了修改,以便将用户也分配给 Okta 应用程序。如果您正在审核/加入其他公司/目录数据,则很有用。
library(jsonlite)
library(dplyr)
library(httr)
library(purrr)
library(stringi)
library(tidyr)
# create character vector to hold URLs we'll use later when we GET content
url_list <- as.character()
# list placeholder for GET content
okta_content <- list()
# initial URL construction parts for first URL
okta_urllimit = as.character("200")
okta_baseurl <- paste0("https://<your company>.okta.com/api/v1/users?limit=",okta_urllimit)
# next URL construction parts for 'next' URLs
basenexturl <- "https://<your company>.okta.com/api/v1/users?after="
baselimiturl <- "&limit=200"
# Pass initial URL to get first batch
okta_get01 <- httr::GET(okta_baseurl,
config = (
add_headers(Authorization = "SSWS <your Okta API key>")))
# append the URL vector
url_list <- append(url_list, okta_baseurl)
# unlist the all_headers list element from the URL
testallheaders <- as.character(unlist(okta_get01$all_headers))
okta_content <- append(okta_content,content(okta_get01))
# if "next" is in the second link URL (testallheaders[16]) then iterate for as long as
# the next URL header element has "next" in it
while (
grepl("next",testallheaders[16]) == 'TRUE'
)
{
# parse the sha value
testparsenext <- regmatches(testallheaders[16], gregexpr('(?<=after=).*?(?=&limit)',testallheaders[16], perl=T))[[1]]
# and create URL
oktaurlnext <- paste0(basenexturl,testparsenext,baselimiturl)
# iterate and replace 'okta_baseurl' with each subsquent oktaurlnext
okta_get01 <- httr::GET(oktaurlnext,
config = (
add_headers(Authorization = "SSWS <your Okta API key>")))
testallheaders <- as.character(unlist(okta_get01$all_headers))
url_list <- append(url_list, oktaurlnext)
okta_content <- append(okta_content,content(okta_get01))
next
}
# Parse the results into something usable
oktagettojson <- toJSON(okta_content, simplifyDataFrame = TRUE, flatten = TRUE, recursive = TRUE)
oktagetdf <- fromJSON(oktagettojson, simplifyDataFrame = TRUE, flatten = TRUE)
dfnames <- names(oktagetdf)
oktagetdf <- oktagetdf %>% map_if(is.list, as.character)
oktagetdf <- do.call(cbind, lapply(oktagetdf, data.frame, stringsAsFactors=FALSE))
names(oktagetdf) <- dfnames
# adding columns to separate AD domain mastered account and domain names
oktagetdf <- separate(oktagetdf, profile.login,
into = c("credPrefix", "credSuffix"), sep = "@", remove = FALSE, extra = "drop")
# select some data frame columns of interest
okta_allusers <- subset(oktagetdf, select = c("id","status","created","lastLogin","profile.login","credPrefix", "credSuffix","profile.firstName","profile.lastName","profile.email","credentials.provider.type","credentials.provider.name"))
刚刚花了一些时间解决这个问题,想分享一个更简单的替代解决方案。在没有一些额外逻辑的情况下找不到正则表达式下一个 link 的好方法,但假设您的 API returns 完全形成可以遵循的 link 则可以重复使用。
library(jsonlite)
library(httr)
library(stringr)
res <- GET(<yourURL>,token)
resDF <- fromJSON(httr::content(res, as = "text"))
while (grepl("next", res$headers$link) == 'TRUE')
{
res <- GET(
ifelse(grepl("prev", res$headers$link) == 'TRUE',
str_match(res$headers$link, "prev, <(.*)>; rel=next")[1,2]
,
str_match(res$headers$link, "first, <(.*)>; rel=next")[1,2]
)
,token)
resDF <- rbind(resDF, fromJSON(httr::content(res, as = "text")))
}