如果解析逻辑中存在拆分,则使用正则表达式解析 "hierarchical" URL
Parsing a "hierarchical" URL with regexes if there are splits in parsing logic
有什么方法可以将剩余的正则表达式模式调整为已经匹配的模式吗?一个粗略的草图来说明这个想法:
pattern
/ | \
/ | \
prefix1 prefix2 prefix3
| | |
postfix1 postfix2 postfix3
这是一个比较理论化的问题;以下实际应用仅供说明之用。
我试图在大文本中找到第一个 URL 流行的代码托管平台,如 github、gitlab 等。问题是,所有平台都有不同的 URL 模式:
github.com/<user>/<repo>
gitlab.com/<group1>/<group2>/.../<repo>
sourceforge.net/projects/<repo>
我可以使用 lookbehind 表达式,但是表达式变得非常可怕 (Python re):
pattern = re.compile(
r"(github\.com|bitbucket\.org|gitlab\.com|sourceforge\.net)/"
# middle part - empty for all except sourceforge
r"(?:(?<=github\.com/)|(?<=bitbucket\.org/)|(?<=gitlab\.com/)|"
r"(?<=sourceforge\.net/)projects/)("
# final part, the repository pattern
r"(?<=github\.com/)[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+|"
r"(?<=bitbucket\.org/)[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+|"
r"(?<=gitlab\.com/)[a-zA-Z0-9_.-]+(?:/[a-zA-Z0-9_.-]+)+|"
r"(?<=sourceforge\.net/projects/)[a-zA-Z0-9_.-]+"
r")")
有没有更优雅的方法来做这样的事情?
最好的方法可能是使用自定义解析器并以状态机方式进行解析:首先确定站点,然后采用特定于站点的路径:
patterns={
'github.com': r'/(?P<user>[^/]+)/(?P<project>[^/#]+)(?:[/#]|$)',
'sourceforge.net': r'/projects/(?P<project>)[^/]+/',
<etc etc etc>
}
import urllib.parse
pr = urllib.parse.urlparse(url)
site = pr.hostname # in case port is specified
parts = re.match(patterns[site], pr.path).groupdict()
路径也可以用状态机解析,而不是正则表达式,如果前面有进一步的拆分,这可能更易于管理:
(they recommend a enum
instead of magic strings for states;我使用魔术字符串只是为了简化示例代码)
def parse_github(path):
r = argparse.Namespace()
pp = path.split('/')
p = pp.pop(0)
assert(p == '')
state='user'
for p in pp: # we dont need to backtrack in this case,
# so `for' is a fitting mechanism to iterate
# over the parts.
# if we needed to backtrack, we'd have to use
# an index variable or a stack or something
if state=='user':
r.user=p
state='project'
else if state=='project':
r.project==p
state='kind'
else if state=='kind':
if p in {'pull','commit','blob'}:
state=p
else: break #end parsing, ignore anything that's left
else if state=='pull':
r.pr=p
state='pr_tab'
<etc etc>
return r
原则上,这里没有递归结构,所以这个可以单独用正则表达式来完成,但这很尴尬:
site_patterns = [
r"(github\.com/)[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+",
r"(bitbucket\.org/)[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+",
r"(gitlab\.com/)[a-zA-Z0-9_.-]+(?:/[a-zA-Z0-9_.-]+)+",
r"(sourceforge\.net/projects/)[a-zA-Z0-9_.-]+",
<etc etc etc>
]
r_all = re.compile("("+"|".join(site_patterns)+")") #good luck debugging this monster
有什么方法可以将剩余的正则表达式模式调整为已经匹配的模式吗?一个粗略的草图来说明这个想法:
pattern
/ | \
/ | \
prefix1 prefix2 prefix3
| | |
postfix1 postfix2 postfix3
这是一个比较理论化的问题;以下实际应用仅供说明之用。
我试图在大文本中找到第一个 URL 流行的代码托管平台,如 github、gitlab 等。问题是,所有平台都有不同的 URL 模式:
github.com/<user>/<repo>
gitlab.com/<group1>/<group2>/.../<repo>
sourceforge.net/projects/<repo>
我可以使用 lookbehind 表达式,但是表达式变得非常可怕 (Python re):
pattern = re.compile(
r"(github\.com|bitbucket\.org|gitlab\.com|sourceforge\.net)/"
# middle part - empty for all except sourceforge
r"(?:(?<=github\.com/)|(?<=bitbucket\.org/)|(?<=gitlab\.com/)|"
r"(?<=sourceforge\.net/)projects/)("
# final part, the repository pattern
r"(?<=github\.com/)[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+|"
r"(?<=bitbucket\.org/)[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+|"
r"(?<=gitlab\.com/)[a-zA-Z0-9_.-]+(?:/[a-zA-Z0-9_.-]+)+|"
r"(?<=sourceforge\.net/projects/)[a-zA-Z0-9_.-]+"
r")")
有没有更优雅的方法来做这样的事情?
最好的方法可能是使用自定义解析器并以状态机方式进行解析:首先确定站点,然后采用特定于站点的路径:
patterns={
'github.com': r'/(?P<user>[^/]+)/(?P<project>[^/#]+)(?:[/#]|$)',
'sourceforge.net': r'/projects/(?P<project>)[^/]+/',
<etc etc etc>
}
import urllib.parse
pr = urllib.parse.urlparse(url)
site = pr.hostname # in case port is specified
parts = re.match(patterns[site], pr.path).groupdict()
路径也可以用状态机解析,而不是正则表达式,如果前面有进一步的拆分,这可能更易于管理:
(they recommend a enum
instead of magic strings for states;我使用魔术字符串只是为了简化示例代码)
def parse_github(path):
r = argparse.Namespace()
pp = path.split('/')
p = pp.pop(0)
assert(p == '')
state='user'
for p in pp: # we dont need to backtrack in this case,
# so `for' is a fitting mechanism to iterate
# over the parts.
# if we needed to backtrack, we'd have to use
# an index variable or a stack or something
if state=='user':
r.user=p
state='project'
else if state=='project':
r.project==p
state='kind'
else if state=='kind':
if p in {'pull','commit','blob'}:
state=p
else: break #end parsing, ignore anything that's left
else if state=='pull':
r.pr=p
state='pr_tab'
<etc etc>
return r
原则上,这里没有递归结构,所以这个可以单独用正则表达式来完成,但这很尴尬:
site_patterns = [
r"(github\.com/)[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+",
r"(bitbucket\.org/)[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+",
r"(gitlab\.com/)[a-zA-Z0-9_.-]+(?:/[a-zA-Z0-9_.-]+)+",
r"(sourceforge\.net/projects/)[a-zA-Z0-9_.-]+",
<etc etc etc>
]
r_all = re.compile("("+"|".join(site_patterns)+")") #good luck debugging this monster