字符串中的多个捕获
Multiple captures within a string
还没有在 Q/A 上找到能够很好地回答这种情况的。我已经实施了一些解决方案,以达到我所能达到的程度。
我正在解析 VCF files 的 header(元数据)部分。每行的格式为:
##TAG=<key=val,key=val,...>
我有一个正则表达式可以解析 <>
中的多个 k-v 对,但我似乎无法添加到 <>
中并使其仍然“有效”。
s = 'a=1,b=two,c="three"'
pat = re.compile(r'''(?P<key>\w+)=(?P<value>[^,]*),?''')
match = pat.findall(s)
print(dict(match))
#{'a': '1', 'b': 'two', 'c': '"three"'}
此外,
s = 'a=1,b=two,c="three"'
pat = re.compile(r'''(?:(?P<key>\w+)=(?P<value>[^,]*),?)''')
match = pat.findall(s)
print(match)
print(dict(match))
#[('a', '1'), ('b', 'two'), ('c', '"three"')]
#{'a': '1', 'b': 'two', 'c': '"three"'}
所以,我认为我可以做到:
s = '<a=1,b=two,c="three">'
pat = re.compile(r'''<(?:(?P<key>\w+)=(?P<value>[^,]*),?)>''')
match = pat.findall(s)
print(match)
print(dict(match))
#[]
#{}
如果可能的话,我真的很想做这样的事情:
\#\#(?P<tag>)=<(?:(?P<key>\w+)=(?P<value>[^,]*),?)>
并捕获 TAG 和所有 k-v 对。显然,我希望它能“工作”。
我意识到这里的“正确”解决方案可能是使用解析器而不是正则表达式。但我是生物信息学的人,不是程序员。格式非常一致,并按照(几乎)始终遵循的标准化规范进行布局。
import regex
s = '##TAG=<key=val,key2=val2>'
pat = regex.compile(r'''##(?P<tag>\w+)=<(?:(?P<key>\w+)=(?P<value>[^,<>]*),?)*>''')
match = pat.search(s)
print([match.group("tag"), list(zip(match.captures("key"), match.captures("value")))])
见Python proof | Regex explanation
--------------------------------------------------------------------------------
## '##'
--------------------------------------------------------------------------------
(?P<tag> group and capture to \k<tag>:
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \k<tag>
--------------------------------------------------------------------------------
=< '=<'
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
(?P<key> group and capture to \k<key>:
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1
or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \k<key>
--------------------------------------------------------------------------------
= '='
--------------------------------------------------------------------------------
(?P<value> group and capture to \k<value>:
--------------------------------------------------------------------------------
[^,<>]* any character except: ',', '<', '>' (0
or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \k<value>
--------------------------------------------------------------------------------
,? ',' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
> '>'
结果:['TAG', [('key', 'val'), ('key2', 'val2')]]
还没有在 Q/A 上找到能够很好地回答这种情况的。我已经实施了一些解决方案,以达到我所能达到的程度。
我正在解析 VCF files 的 header(元数据)部分。每行的格式为:
##TAG=<key=val,key=val,...>
我有一个正则表达式可以解析 <>
中的多个 k-v 对,但我似乎无法添加到 <>
中并使其仍然“有效”。
s = 'a=1,b=two,c="three"'
pat = re.compile(r'''(?P<key>\w+)=(?P<value>[^,]*),?''')
match = pat.findall(s)
print(dict(match))
#{'a': '1', 'b': 'two', 'c': '"three"'}
此外,
s = 'a=1,b=two,c="three"'
pat = re.compile(r'''(?:(?P<key>\w+)=(?P<value>[^,]*),?)''')
match = pat.findall(s)
print(match)
print(dict(match))
#[('a', '1'), ('b', 'two'), ('c', '"three"')]
#{'a': '1', 'b': 'two', 'c': '"three"'}
所以,我认为我可以做到:
s = '<a=1,b=two,c="three">'
pat = re.compile(r'''<(?:(?P<key>\w+)=(?P<value>[^,]*),?)>''')
match = pat.findall(s)
print(match)
print(dict(match))
#[]
#{}
如果可能的话,我真的很想做这样的事情:
\#\#(?P<tag>)=<(?:(?P<key>\w+)=(?P<value>[^,]*),?)>
并捕获 TAG 和所有 k-v 对。显然,我希望它能“工作”。
我意识到这里的“正确”解决方案可能是使用解析器而不是正则表达式。但我是生物信息学的人,不是程序员。格式非常一致,并按照(几乎)始终遵循的标准化规范进行布局。
import regex
s = '##TAG=<key=val,key2=val2>'
pat = regex.compile(r'''##(?P<tag>\w+)=<(?:(?P<key>\w+)=(?P<value>[^,<>]*),?)*>''')
match = pat.search(s)
print([match.group("tag"), list(zip(match.captures("key"), match.captures("value")))])
见Python proof | Regex explanation
--------------------------------------------------------------------------------
## '##'
--------------------------------------------------------------------------------
(?P<tag> group and capture to \k<tag>:
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \k<tag>
--------------------------------------------------------------------------------
=< '=<'
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
(?P<key> group and capture to \k<key>:
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1
or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \k<key>
--------------------------------------------------------------------------------
= '='
--------------------------------------------------------------------------------
(?P<value> group and capture to \k<value>:
--------------------------------------------------------------------------------
[^,<>]* any character except: ',', '<', '>' (0
or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \k<value>
--------------------------------------------------------------------------------
,? ',' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
> '>'
结果:['TAG', [('key', 'val'), ('key2', 'val2')]]