为什么 urllib.parse 不能在所有情况下正确拆分 URL <scheme>:<number>？

Question

如果我输入 <scheme>:<integer> 形式的 URL，则根据所使用的方案，这两个函数都无法正确拆分方案。如果我通过添加非数字字符来更改 <integer>，这将按预期工作。（我在 python 3.8.8）

>>> from urllib.parse import urlparse
>>> urlparse("custom:12345")  # does not work
ParseResult(scheme='', netloc='', path='custom:12345', params='', query='', fragment='')
>>> urlparse("zip:12345")  # does not work
ParseResult(scheme='', netloc='', path='zip:12345', params='', query='', fragment='')
urlparse("custom:12345d") # this works  as expected
ParseResult(scheme='custom', netloc='', path='12345d', params='', query='', fragment='')
>>> urlparse("custom:12345.")  # so does this
ParseResult(scheme='custom', netloc='', path='12345.', params='', query='', fragment='')
>>> urlparse("http:12345")  # for some reason this works (!?)
ParseResult(scheme='http', netloc='', path='12345', params='', query='', fragment='')
>>> urlparse("https:12345") # yet this does not
ParseResult(scheme='', netloc='', path='https:12345', params='', query='', fragment='')
>>> urlparse("ftp:12345")  # no luck here neither   
ParseResult(scheme='', netloc='', path='ftp:12345', params='', query='', fragment='')

根据维基百科，URI 需要一个方案。空方案应该对应于 URI references，它应该只将 <scheme>:<number> 视为包含冒号的无模式（相对）路径，前提是它前面有 ./.

那么为什么会像上面演示的那样中断呢？我所期望的是，以上所有情况都将 URI/URL 拆分为 <scheme>:<number>，其中 <number> 是路径。

Answer 1

如果路径中包含非数字字符，您会看到不同的结果，因为 this section:

# make sure "url" is not actually a port number (in which case
# "scheme" is really part of the path)
rest = url[i+1:]
if not rest or any(c not in '0123456789' for c in rest):
    # not a port number
    scheme, url = url[:i].lower(), rest

在 Python 3.8 中，如果输入具有 "<stuff>:<numbers>" 形式，则 numbers 被假定为端口，其中如果 stuff 不被视为一个方案，它最终都在路径中。

这在 Python 3.9 中被报告为 a bug and (after quite a lot of back and forth!) fixed；上面被简单地改写为：

scheme, url = url[:i].lower(), url[i+1:]

（并删除了 url[:i] == 'http' 的一些特殊外壳）。

为什么 urllib.parse 不能在所有情况下正确拆分 URL <scheme>:<number>？

Why does urllib.parse not split the the URL <scheme>:<number> correctly in all cases?

python

url

path

urllib

url-scheme