复杂条件正则表达式分组
Complex conditional Regex grouping
我有以下文字:
1. The SBI considered this sub-item at its resumed 1^st^ meeting and at its 2^nd^ meeting (see
para. 46 above). It had before it documents FCCC/CP/2011/7 and Corr.1 and Add.1 and 2,
FCCC/SBI/2010/17, FCCC/SBI/2010/26 and FCCC/SBI/2010/MISC.9. A statement was made on behalf of the LDCs.
我对提取遵循以下模式的文档名称感兴趣:(FCCC\/(?:SBSTA|SBI|CP|KP\/CMP|PA\/CMA)\/[0-9]{4}\/(?:INF\.|L\.|MISC\.)?[0-9]+(?:\/Add\.[0-9])?(?:\/Rev\.[0-9]+)?)
这是 FCCC/Document_type(SBSTA or SBI, etc.)/Year/Number
,它们可能有也可能没有添加、更正和修订。
有两种方法可以引用添加或修订:
- 在名称末尾添加:添加 /Rev 或 /Add + 数字
- 或
and Rev|Add|Corr .num
然后根据我感兴趣的文本构建第二个选项引用的名称。例如,映射:FCCC/CP/2011/7 and Corr.1 and Add.1 and 2
到 ["FCCC/CP/2011/7", "FCCC/CP/2011/7/Corr.1", "FCCC/CP/2011/7/Add.1", "FCCC/CP/2011/7/Add.2"]
.
这是我目前的做法:
def _find_documents(par: str) -> Union[list, None]:
"""
Finds referenced documents
:param par:
:return:
"""
found_list = []
pattern = r"(FCCC\/(?:SBSTA|SBI|CP|KP\/CMP|PA\/CMA)\/[0-9]{4}\/(?:INF\.|L\.|MISC\.)?[0-9]+(?:\/Add\.[0-9])?(?:\/Rev\.[0-9]+)?)"
found = re.findall(pattern, par)
# Now, we look for corrections and Revisions
for doc in found:
found_list.append(doc)
doc = doc.replace(r"/", r"\/")
pattern = doc + r"(?: and ((:?Corr\.|Add\.)?[0-9]))?(?: and ((:?Corr\.|Add\.)[0-9]))*(:? and ([0-9])+)?"
res = re.search(pattern, par).groups()
for pat in res:
if pat is not None:
found_list.append(doc + "/" + pat)
return found_list if found_list is not None else None
st = r"""
50. The SBI considered this sub-item at its resumed 1^st^ meeting and at its 2^nd^ meeting (see para. 46 above). It had before it documents FCCC/CP/2011/7 and Corr.1 and Add.1 and 2, FCCC/SBI/2010/17, FCCC/SBI/2010/26 and FCCC/SBI/2010/MISC.9. A statement was made on behalf of the LDCs.
"""
_find_documents(st)
""" [OUT]:
['FCCC/CP/2011/7',
'FCCC\/CP\/2011\/7/Corr.1',
'FCCC\/CP\/2011\/7/Corr.', EXTRA
'FCCC\/CP\/2011\/7/Add.1',
'FCCC\/CP\/2011\/7/Add.', EXTRA
'FCCC\/CP\/2011\/7/ and 2', EXTRA
'FCCC\/CP\/2011\/7/2', WRONG Should be FCCC/CP/2011/7/Add.2
'FCCC/SBI/2010/17',
'FCCC/SBI/2010/26',
'FCCC/SBI/2010/MISC.9']"""
如您所见,我有几个问题不知道如何解决。
- 小组获得额外比赛
["Add.", "Corr.", "and 2"]
- 当我尝试附加 Corrs、Apps 时,
/
以某种方式被转义。
- 不确定如何将子匹配
and 2
映射到 /Add.2
或 /Corr.2
,具体取决于之前的
有什么想法吗?
谢谢,
你可以使用
import re
text = "50. The SBI considered this sub-item at its resumed 1^st^ meeting and at its 2^nd^ meeting (see para. 46 above). It had before it documents FCCC/CP/2011/7 and Corr.1 and Add.1 and 2, FCCC/SBI/2010/17, FCCC/SBI/2010/26 and FCCC/SBI/2010/MISC.9. A statement was made on behalf of the LDCs."
rx_main = re.compile(r'(FCCC/(?:SBSTA|SBI|CP|KP/CMP|PA/CMA)/\d{4}/(?:INF\.|L\.|MISC\.)?\d+)((?:(?:/|\s+and\s+|\s*,\s*)(?:Add|Rev|Corr)\.\d+(?:(?:\s*,\s*|\s+and\s+)\d+)*)*)')
rx_rev = re.compile(r'(?:Add|Rev|Corr)\.\d+(?:(?:\s*,\s*|\s+and\s+)\d+)*')
rx_split = re.compile(r'\s*,\s*|\s+and\s+')
matches = rx_main.finditer(text)
results = []
for m in matches:
results.append(m.group(1))
chunks = [rx_split.split(x) for x in rx_rev.findall(m.group(2))]
for ch in chunks:
if len(ch) == 1: # it is simple, just add first item to Group 1
results.append(f"{m.group(1)}/{ch[0]}")
else:
name = ch[0].split('.')[0] # Rev, Corr or Add
for c in ch:
if '.' in c: # if there is a dot, append whole string to Group 1
results.append(f"{m.group(1)}/{c}")
else:
results.append(f"{m.group(1)}/{name}.{c}") # Append the new number to Corr/Add/Rev
print(results)
输出:
['FCCC/CP/2011/7', 'FCCC/CP/2011/7/Corr.1', 'FCCC/CP/2011/7/Add.1', 'FCCC/CP/2011/7/Add.2', 'FCCC/SBI/2010/17', 'FCCC/SBI/2010/26', 'FCCC/SBI/2010/MISC.9']
参见 this Python demo。
新的正则表达式是
(FCCC/(?:SBSTA|SBI|CP|KP/CMP|PA/CMA)/\d{4}/(?:INF\.|L\.|MISC\.)?\d+)((?:(?:/|\s+and\s+|\s*,\s*)(?:Add|Rev|Corr)\.\d+(?:(?:\s*,\s*|\s+and\s+)\d+)*)*)
参见regex demo。
我有以下文字:
1. The SBI considered this sub-item at its resumed 1^st^ meeting and at its 2^nd^ meeting (see
para. 46 above). It had before it documents FCCC/CP/2011/7 and Corr.1 and Add.1 and 2,
FCCC/SBI/2010/17, FCCC/SBI/2010/26 and FCCC/SBI/2010/MISC.9. A statement was made on behalf of the LDCs.
我对提取遵循以下模式的文档名称感兴趣:(FCCC\/(?:SBSTA|SBI|CP|KP\/CMP|PA\/CMA)\/[0-9]{4}\/(?:INF\.|L\.|MISC\.)?[0-9]+(?:\/Add\.[0-9])?(?:\/Rev\.[0-9]+)?)
这是 FCCC/Document_type(SBSTA or SBI, etc.)/Year/Number
,它们可能有也可能没有添加、更正和修订。
有两种方法可以引用添加或修订:
- 在名称末尾添加:添加 /Rev 或 /Add + 数字
- 或
and Rev|Add|Corr .num
然后根据我感兴趣的文本构建第二个选项引用的名称。例如,映射:FCCC/CP/2011/7 and Corr.1 and Add.1 and 2
到 ["FCCC/CP/2011/7", "FCCC/CP/2011/7/Corr.1", "FCCC/CP/2011/7/Add.1", "FCCC/CP/2011/7/Add.2"]
.
这是我目前的做法:
def _find_documents(par: str) -> Union[list, None]:
"""
Finds referenced documents
:param par:
:return:
"""
found_list = []
pattern = r"(FCCC\/(?:SBSTA|SBI|CP|KP\/CMP|PA\/CMA)\/[0-9]{4}\/(?:INF\.|L\.|MISC\.)?[0-9]+(?:\/Add\.[0-9])?(?:\/Rev\.[0-9]+)?)"
found = re.findall(pattern, par)
# Now, we look for corrections and Revisions
for doc in found:
found_list.append(doc)
doc = doc.replace(r"/", r"\/")
pattern = doc + r"(?: and ((:?Corr\.|Add\.)?[0-9]))?(?: and ((:?Corr\.|Add\.)[0-9]))*(:? and ([0-9])+)?"
res = re.search(pattern, par).groups()
for pat in res:
if pat is not None:
found_list.append(doc + "/" + pat)
return found_list if found_list is not None else None
st = r"""
50. The SBI considered this sub-item at its resumed 1^st^ meeting and at its 2^nd^ meeting (see para. 46 above). It had before it documents FCCC/CP/2011/7 and Corr.1 and Add.1 and 2, FCCC/SBI/2010/17, FCCC/SBI/2010/26 and FCCC/SBI/2010/MISC.9. A statement was made on behalf of the LDCs.
"""
_find_documents(st)
""" [OUT]:
['FCCC/CP/2011/7',
'FCCC\/CP\/2011\/7/Corr.1',
'FCCC\/CP\/2011\/7/Corr.', EXTRA
'FCCC\/CP\/2011\/7/Add.1',
'FCCC\/CP\/2011\/7/Add.', EXTRA
'FCCC\/CP\/2011\/7/ and 2', EXTRA
'FCCC\/CP\/2011\/7/2', WRONG Should be FCCC/CP/2011/7/Add.2
'FCCC/SBI/2010/17',
'FCCC/SBI/2010/26',
'FCCC/SBI/2010/MISC.9']"""
如您所见,我有几个问题不知道如何解决。
- 小组获得额外比赛
["Add.", "Corr.", "and 2"]
- 当我尝试附加 Corrs、Apps 时,
/
以某种方式被转义。 - 不确定如何将子匹配
and 2
映射到/Add.2
或/Corr.2
,具体取决于之前的
有什么想法吗?
谢谢,
你可以使用
import re
text = "50. The SBI considered this sub-item at its resumed 1^st^ meeting and at its 2^nd^ meeting (see para. 46 above). It had before it documents FCCC/CP/2011/7 and Corr.1 and Add.1 and 2, FCCC/SBI/2010/17, FCCC/SBI/2010/26 and FCCC/SBI/2010/MISC.9. A statement was made on behalf of the LDCs."
rx_main = re.compile(r'(FCCC/(?:SBSTA|SBI|CP|KP/CMP|PA/CMA)/\d{4}/(?:INF\.|L\.|MISC\.)?\d+)((?:(?:/|\s+and\s+|\s*,\s*)(?:Add|Rev|Corr)\.\d+(?:(?:\s*,\s*|\s+and\s+)\d+)*)*)')
rx_rev = re.compile(r'(?:Add|Rev|Corr)\.\d+(?:(?:\s*,\s*|\s+and\s+)\d+)*')
rx_split = re.compile(r'\s*,\s*|\s+and\s+')
matches = rx_main.finditer(text)
results = []
for m in matches:
results.append(m.group(1))
chunks = [rx_split.split(x) for x in rx_rev.findall(m.group(2))]
for ch in chunks:
if len(ch) == 1: # it is simple, just add first item to Group 1
results.append(f"{m.group(1)}/{ch[0]}")
else:
name = ch[0].split('.')[0] # Rev, Corr or Add
for c in ch:
if '.' in c: # if there is a dot, append whole string to Group 1
results.append(f"{m.group(1)}/{c}")
else:
results.append(f"{m.group(1)}/{name}.{c}") # Append the new number to Corr/Add/Rev
print(results)
输出:
['FCCC/CP/2011/7', 'FCCC/CP/2011/7/Corr.1', 'FCCC/CP/2011/7/Add.1', 'FCCC/CP/2011/7/Add.2', 'FCCC/SBI/2010/17', 'FCCC/SBI/2010/26', 'FCCC/SBI/2010/MISC.9']
参见 this Python demo。
新的正则表达式是
(FCCC/(?:SBSTA|SBI|CP|KP/CMP|PA/CMA)/\d{4}/(?:INF\.|L\.|MISC\.)?\d+)((?:(?:/|\s+and\s+|\s*,\s*)(?:Add|Rev|Corr)\.\d+(?:(?:\s*,\s*|\s+and\s+)\d+)*)*)
参见regex demo。