如何将带引号的字符串作为 URL 输入的一部分正确传递给 httr:GET()？

Question

假设我想像这样将 URL 传递给 httr::GET():

https://www.uniprot.org/uniprot/?query=name%3A"dna+methyltransferase"

我如何将这个字符串的引号部分（即 "dna+methyltransferase"）作为输入正确传递？ 我的输入 URL 字符串是存储如下，直接传递它不起作用，因为转义双引号没有被评估：

> urlinp <- "https://www.uniprot.org/uniprot/?query=name%3A\"dna+methyltransferase\""
> status_code(GET(urlinp))
# [1] 400

我的一个想法是使用 capture.output() 和 cat() 来尝试传递（已解析的）字符串，但这也不起作用：

> status_code(GET(capture.output(cat(urlinp))))
[1] 400

坦率地说，我不知道该怎么做。谷歌搜索并没有真正帮助（或者我正在使用不合适的术语进行搜索）。任何指针将不胜感激。

编辑：更新了下面的上下文。

所以，我基本上有一个小函数，它将两个字符串 SoughtProtein 和 SoughtTaxon 作为输入，并从中制定一个 URL 查询 (?)，如下所示。

UniProtQueryConstructor <- function(SoughtProtein = NULL, SoughtTaxon = NULL){

  #Function constants
  tmpUniProtBaseURL <- "https://www.uniprot.org/uniprot/"
  tmpUniProtURLRetFormat <- "&format=tab"

  #Formatting steps below
  if(!is.null(SoughtProtein)){


    #If protein name has more than one word (e.g., "DNA methyltrasferase"), then having that string enclosed in double quotes

    if(stringr::str_detect(SoughtProtein, "\s")){

      #Lowercaseing the string, and replaceing punctuation with "+"
      innertmpProtName <- stringr::str_replace_all(paste0(tolower(SoughtProtein)), regex("[[:punct:]\s]+"), "+")

      #Enclosing the multi-word string in double quotes
      innertmpProtName <- paste0('\"', innertmpProtName, '\"')

      #Writing it to a temporary variable that will be passed on for final URL assembly
      tmpProtName <- paste0("name%3A", innertmpProtName)

    } else{

      #Else condition is a simple case, since there is no multi-word string to be dealt with

      tmpProtName <- paste0("name%3A", stringr::str_replace_all(paste0(tolower(SoughtProtein)), regex("[[:punct:]\s]+"), "+"))

    }

  } else{ 

    #Else assign empty string to protin name if user input is non-existent

    tmpProtName <- ""

  }

  #Input string prep for taxon selection
  if(!is.null(SoughtTaxon)){

    tmpTaxon <- paste0("taxonomy%3A", stringr::str_replace_all(paste0(tolower(SoughtTaxon)), regex("[[:punct:]\s]+"), "+"))

  } else{

    tmpTaxon <- ""

  }


  #Combining user inputs into once single string
  tmpInpTermList <- c(tmpProtName, tmpTaxon)


  #Preparing query string
  tmpAssembledUniProtQuery <- paste0("?query=", paste(tmpInpTermList[which(nchar(tmpInpTermList) > 0)], sep = "", collapse = "+AND+"))


  #Full query URL
  tmpFullUniProtSearchURL <- paste0(tmpUniProtBaseURL, tmpAssembledUniProtQuery, tmpUniProtURLRetFormat)

  return(tmpFullUniProtSearchURL)
}

#Test case below

TestSearch <- UniProtQueryConstructor(SoughtProtein = "DNA methyltransferase", SoughtTaxon = "Eukaryota")

#Double quotes within the string not dealt with properly.
TestSearch

# [1] "https://www.uniprot.org/uniprot/?query=name%3A\"dna+methyltransferase\"+AND+taxonomy%3Aeukaryota&format=tab"

问题是此函数需要能够处理输入字符串包含多个由 space（例如 "DNA methyltransferse"）分隔的单词的输入，方法是将它们括在双引号中在查询字符串中如下：

query=name%3A"dna+methyltransferase"

这就是我运行我的问题所在，因为我无法正确显示转义双引号（如示例输出所示）。

我写了这个更新，就在收到 URLencode() 的多个答案时。我认为提出的解决方案解决了手头的问题（正确解析字符串），也稍微缓解了整个问题（我写代码很糟糕；我今天学到了一些新东西！）。

Answer 1

让我们做一些事情来处理这个问题：

正如@camille 正确指出的那样，用单引号括起来更容易。
当我们这样做的时候，让我们用它代表的冒号替换 URL 模板中的“%3A”。
现在，让我们使用 URLencode。这将为我们处理引号、冒号和其他任何内容。

那我们就搞定了。

library(httr)
# Correct sample format for URL
# https://www.uniprot.org/uniprot/?query=name%3A%22dna+methyltransferase%22&sort=score
query_url <- 'https://www.uniprot.org/uniprot/?query=name:"dna+methyltransferase"' 
encoded_url <- URLencode(query_url)
resp <- httr::GET(encoded_url)
status_code(resp)
#> [1] 200

^{由 reprex package (v0.3.0)}

于 2019-11-23 创建

Answer 2

我试图找到已经涵盖这一点的帖子，但这里有一些细节让我失望了。您可以使用 utils::URLencode 对 URL 进行编码，这样引号将被替换为它们的 percent-encoded equivalents.

URLencode 有一个参数 repeated，默认为 false：

repeated—logical: should apparently already-encoded URLs be encoded again?

An ‘apparently already-encoded URL’ is one containing %xx for two hexadecimal digits.

您的 URL 已经有一个片段用 %3A 编码，: 的编码版本；因为编码的子字符串已经存在，默认情况下不会进行进一步的编码。相反，设置 repeated = FALSE，引号也会被编码：

library(httr)

urlinp <- 'https://www.uniprot.org/uniprot/?query=name%3A"dna+methyltransferase"'

URLencode(urlinp, repeated = FALSE)
#> [1] "https://www.uniprot.org/uniprot/?query=name%3A\"dna+methyltransferase\""
URLencode(urlinp, repeated = TRUE)
#> [1] "https://www.uniprot.org/uniprot/?query=name%253A%22dna+methyltransferase%22"

status_code(GET(URLencode(urlinp, repeated = TRUE)))
#> [1] 200

如何将带引号的字符串作为 URL 输入的一部分正确传递给 httr:GET()？

How to properly pass quoted strings as part of URL input to httr:GET()?

r

text-parsing

httr