解析 URL 路径时 python 和 ruby 不同,哪个有效?

different between python and ruby when parsing URL path, which is valid?

我有一个 URL 字符串:

url = "https://foo.bar.com/path/to/aaa.bbb/ccc.ddd;dc_trk_aid=486652617;tfua=;gdpr=;gdpr_consent=?&339286293"

使用 Python

from urllib.parse import urlparse

url_obj = urlparse(url)
url_obj.path  # `path/to/aaa.bbb/ccc.ddd`

使用 ruby

url_obj = URI.parse(url)

url_obj.path # `path/to/aaa.bbb/ccc.ddd;dc_trk_aid=486652617;tfua=;gdpr=;gdpr_consent=`

我猜 python 考虑 ; 不是 url 路径的一部分,哪个是 'correct'?

Python的urllib是错误的。 RFC 3986 Uniform Resource Identifier (URI): Generic Syntax, Section 3.3 Path 明确给出了这个确切的语法作为有效路径的示例 [bold 强调我的]:

Aside from dot-segments in hierarchical paths, a path segment is considered opaque by the generic syntax. URI producing applications often use the reserved characters allowed in a segment to delimit scheme-specific or dereference-handler-specific subcomponents. For example, the semicolon (";") and equals ("=") reserved characters are often used to delimit parameters and parameter values applicable to that segment. The comma (",") reserved character is often used for similar purposes. For example, one URI producer might use a segment such as "name;v=1.1" to indicate a reference to version 1.1 of "name", whereas another might use a segment such as "name,1.1" to indicate the same. Parameter types may be defined by scheme-specific semantics, but in most cases the syntax of a parameter is specific to the implementation of the URI's dereferencing algorithm.

您发布的示例 URI 的正确 解释如下:

  • 方案 = https
  • 权限 = foo.bar.com
    • 用户信息 = 空
    • 主机 = foo.bar.com
    • 端口 = 空,从方案导出为443
  • 路径 = /path/to/aaa.bbb/ccc.ddd;dc_trk_aid=486652617;tfua=;gdpr=;gdpr_consent=,由以下四个路径段组成:
    1. path
    2. to
    3. aaa.bbb
    4. ccc.ddd;dc_trk_aid=486652617;tfua=;gdpr=;gdpr_consent=
  • 查询 = &339286293
  • 片段 = 空

urlparse 将第一个分号后的 path 部分取为 params:

url_obj.path   # '/path/to/aaa.bbb/ccc.ddd'
url_obj.params # 'dc_trk_aid=486652617;tfua=;gdpr=;gdpr_consent='

要复制 Ruby 的行为,请改用 urlsplit

This is similar to urlparse(), but does not split the params from the URL. This should generally be used instead of urlparse() if the more recent URL syntax allowing parameters to be applied to each segment of the path portion of the URL (see RFC 2396) is wanted.

from urllib.parse import urlsplit

url_obj = urlsplit(url)
url_obj.path  # '/path/to/aaa.bbb/ccc.ddd;dc_trk_aid=486652617;tfua=;gdpr=;gdpr_consent='