OpenURI 在传递带有片段标识符的 URL 时出现问题

OpenURI having issues when passed URLs with fragment identifiers

我需要从文本文件中读取一系列 URL,然后检索页面并输出链接列表。

只要输入 URL 包含片段标识符 (#),代码就会出现问题。我尝试用 %23 转义这些,但这似乎没有帮助。

给出的错误来自 OpenURI,是 404。

#requirements
require 'nokogiri'
require 'open-uri'
#opening each line in input text file
line_num=0
text=File.open('input.txt').read
text.gsub!(/\r\n?/, "\n")
text.each_line do |line|
    print "#{line_num += 1} #{line}"
    open('output.txt', 'a') { |f|
        f.puts "#{line_num} #{line}"
    }
    uri = URI.parse(URI.encode(line.strip))
    page = Nokogiri::HTML(open(uri))   
    links = page.css("div.product-carousel-container a")
    #loop through links if present
    e = 0
    while e < links.length
        open('output.txt', 'a') { |f|
        f.puts links[e]["href"]
        }
        e += 1
    end  
end

问题

不应将 URI 的片段部分发送到服务器。

来自Wikipedia: Fragment Identifier

The fragment identifier functions differently than the rest of the URI: namely, its processing is exclusively client-side with no participation from the web server — of course the server typically helps to determine the MIME type, and the MIME type determines the processing of fragments. When an agent (such as a Web browser) requests a web resource from a Web server, the agent sends the URI to the server, but does not send the fragment. Instead, the agent waits for the server to send the resource, and then the agent processes the resource according to the document type and fragment value.

解决方案

将 URI 的片段部分传递给 open

require "uri"

u = URI.parse "http://example.com#fragment"
u.fragment = nil
u.to_s #=> "http://example.com"

你已经完成了 90%。客户端负责处理片段。

您的代码已经在使用 URI 来解析字符串,因此让解析的对象删除片段:

require 'open-uri'
uri = URI.parse('http://foo.com/index.html#bar')
uri # => #<URI::HTTP http://foo.com/index.html#bar>
uri.fragment = nil
uri # => #<URI::HTTP http://foo.com/index.html>