URL 花括号的编码问题
URL encoding issues with curly braces
我在从 GitHub Archive 获取数据时遇到问题。
主要问题是我在 URL 中编码 {}
和 ..
的问题。也许我误读了 Github API 或者没有正确理解编码。
require 'open-uri'
require 'faraday'
conn = Faraday.new(:url => 'http://data.githubarchive.org/') do |faraday|
faraday.request :url_encoded # form-encode POST params
faraday.response :logger # log requests to STDOUT
faraday.adapter Faraday.default_adapter # make requests with Net::HTTP
end
#query = '2015-01-01-15.json.gz' #this one works!!
query = '2015-01-01-{0..23}.json.gz' #this one doesn't work
encoded_query = URI.encode(query)
response = conn.get(encoded_query)
p response.body
检索一系列文件的 GitHub 存档示例是:
wget http://data.githubarchive.org/2015-01-01-{0..23}.json.gz
{0..23}
部分被 wget 本身解释为 0 .. 23 的范围。您可以通过使用 -v
标志执行该命令来测试它 returns:
wget -v http://data.githubarchive.org/2015-01-01-{0..1}.json.gz
--2015-06-11 13:31:07-- http://data.githubarchive.org/2015-01-01-0.json.gz
Resolving data.githubarchive.org... 74.125.25.128, 2607:f8b0:400e:c03::80
Connecting to data.githubarchive.org|74.125.25.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2615399 (2.5M) [application/x-gzip]
Saving to: '2015-01-01-0.json.gz'
2015-01-01-0.json.gz 100%[===========================================================================================================================================>] 2.49M 3.03MB/s in 0.8s
2015-06-11 13:31:09 (3.03 MB/s) - '2015-01-01-0.json.gz' saved [2615399/2615399]
--2015-06-11 13:31:09-- http://data.githubarchive.org/2015-01-01-1.json.gz
Reusing existing connection to data.githubarchive.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 2535599 (2.4M) [application/x-gzip]
Saving to: '2015-01-01-1.json.gz'
2015-01-01-1.json.gz 100%[===========================================================================================================================================>] 2.42M 867KB/s in 2.9s
2015-06-11 13:31:11 (867 KB/s) - '2015-01-01-1.json.gz' saved [2535599/2535599]
FINISHED --2015-06-11 13:31:11--
Total wall clock time: 4.3s
Downloaded: 2 files, 4.9M in 3.7s (1.33 MB/s)
换句话说,wget 将值代入 URL,然后获取新的 URL。这不是明显的行为,也没有很好的记录,但您可以找到它的提及 "out there"。例如在“All the Wget Commands You Should Know”中:
7. Download a list of sequentially numbered files from a server
wget http://example.com/images/{1..20}.jpg
要执行您想要的操作,您需要使用类似以下未经测试的代码来遍历 Ruby 中的范围:
0.upto(23) do |i|
response = conn.get("/2015-01-01-#{ i }.json.gz")
p response.body
end
为了更好地了解问题所在,让我们从 GitHub 文档中给出的示例开始:
wget http://data.githubarchive.org/2015-01-01-{0..23}.json.gz
这里要注意的是 {0..23}
会自动扩展 bash。您可以通过 运行 以下命令查看:
echo {0..23}
> 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
这意味着 wget
不会只被调用一次,而是总共被调用 24 次。您遇到的问题是 Ruby 不会像 bash 那样自动扩展 {0..23}
,而是直接调用 http://data.githubarchive.org/2015-01-01-{0..23}.json.gz
不存在。
相反,您需要自己循环 0..23
并每次调用一次:
(0..23).each do |n|
query = "2015-01-01-#{n}.json.gz"
encoded_query = URI.encode(query)
response = conn.get(encoded_query)
p response.body
end
我在从 GitHub Archive 获取数据时遇到问题。
主要问题是我在 URL 中编码 {}
和 ..
的问题。也许我误读了 Github API 或者没有正确理解编码。
require 'open-uri'
require 'faraday'
conn = Faraday.new(:url => 'http://data.githubarchive.org/') do |faraday|
faraday.request :url_encoded # form-encode POST params
faraday.response :logger # log requests to STDOUT
faraday.adapter Faraday.default_adapter # make requests with Net::HTTP
end
#query = '2015-01-01-15.json.gz' #this one works!!
query = '2015-01-01-{0..23}.json.gz' #this one doesn't work
encoded_query = URI.encode(query)
response = conn.get(encoded_query)
p response.body
检索一系列文件的 GitHub 存档示例是:
wget http://data.githubarchive.org/2015-01-01-{0..23}.json.gz
{0..23}
部分被 wget 本身解释为 0 .. 23 的范围。您可以通过使用 -v
标志执行该命令来测试它 returns:
wget -v http://data.githubarchive.org/2015-01-01-{0..1}.json.gz
--2015-06-11 13:31:07-- http://data.githubarchive.org/2015-01-01-0.json.gz
Resolving data.githubarchive.org... 74.125.25.128, 2607:f8b0:400e:c03::80
Connecting to data.githubarchive.org|74.125.25.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2615399 (2.5M) [application/x-gzip]
Saving to: '2015-01-01-0.json.gz'
2015-01-01-0.json.gz 100%[===========================================================================================================================================>] 2.49M 3.03MB/s in 0.8s
2015-06-11 13:31:09 (3.03 MB/s) - '2015-01-01-0.json.gz' saved [2615399/2615399]
--2015-06-11 13:31:09-- http://data.githubarchive.org/2015-01-01-1.json.gz
Reusing existing connection to data.githubarchive.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 2535599 (2.4M) [application/x-gzip]
Saving to: '2015-01-01-1.json.gz'
2015-01-01-1.json.gz 100%[===========================================================================================================================================>] 2.42M 867KB/s in 2.9s
2015-06-11 13:31:11 (867 KB/s) - '2015-01-01-1.json.gz' saved [2535599/2535599]
FINISHED --2015-06-11 13:31:11--
Total wall clock time: 4.3s
Downloaded: 2 files, 4.9M in 3.7s (1.33 MB/s)
换句话说,wget 将值代入 URL,然后获取新的 URL。这不是明显的行为,也没有很好的记录,但您可以找到它的提及 "out there"。例如在“All the Wget Commands You Should Know”中:
7. Download a list of sequentially numbered files from a server
wget http://example.com/images/{1..20}.jpg
要执行您想要的操作,您需要使用类似以下未经测试的代码来遍历 Ruby 中的范围:
0.upto(23) do |i|
response = conn.get("/2015-01-01-#{ i }.json.gz")
p response.body
end
为了更好地了解问题所在,让我们从 GitHub 文档中给出的示例开始:
wget http://data.githubarchive.org/2015-01-01-{0..23}.json.gz
这里要注意的是 {0..23}
会自动扩展 bash。您可以通过 运行 以下命令查看:
echo {0..23}
> 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
这意味着 wget
不会只被调用一次,而是总共被调用 24 次。您遇到的问题是 Ruby 不会像 bash 那样自动扩展 {0..23}
,而是直接调用 http://data.githubarchive.org/2015-01-01-{0..23}.json.gz
不存在。
相反,您需要自己循环 0..23
并每次调用一次:
(0..23).each do |n|
query = "2015-01-01-#{n}.json.gz"
encoded_query = URI.encode(query)
response = conn.get(encoded_query)
p response.body
end