python 找不到这个分组名称
python re can't find this grouped name
我尝试对论文参考文献的格式提出建议。例如学术论文,格式为:
author. dissertation name[D]. place where store it: organization who hold the copy, year in which the dissertation published.
很明显,除年份外,每一项都可能有一些标点符号。例如
Smith. The paper name. The subtitle of paper[D]. United States: MIT, 2011
经常会漏掉place where store it
和year
,例如
Smith. The paper name. The subtitle of paper[D]. US, 2011
Smith. The paper name. The subtitle of paper[D]. US: MIT
我想这样编程:
import re
reObj = re.compile(
r'.*\[D\]\. \s* ((?P<PLACE>[^:]*):){0,1} \s* (?P<HOLDER>[^:]*) (?P<YEAR>,\s*(1|2)\d{3}){0,1}',
re.VERBOSE
)
txt = '''Smith. The paper name. The subtitle of paper[D]. US: MIT, 2011
Smith. The paper name. The subtitle of paper[D]. US, 2011
Smith. The paper name. The subtitle of paper[D]. US: MIT'''.split('\n')
for i in txt:
if reObj.search(i):
if reObj.search(i).group('PLACE')==None:
print('missing place')
if reObj.search(i).group('YEAR')==None:
print('missing year')
else:
print('bad formation')
但我发现没有得到 YEAR
对于我在 txt 中:
打印(i)
打印(reObj.search(i).group('HOLDER'))
产出
Smith. The paper name. The subtitle of paper[D]. US: MIT, 2011
MIT, 2011
Smith. The paper name. The subtitle of paper[D]. US, 2011
US, 2011
Smith. The paper name. The subtitle of paper[D]. US: MIT
MIT
for i in txt:
print(i)
print(reObj.search(i).group('YEAR'))
产出
Smith. The paper name. The subtitle of paper[D]. US: MIT, 2011
None
Smith. The paper name. The subtitle of paper[D]. US, 2011
None
Smith. The paper name. The subtitle of paper[D]. US: MIT
None
那么,为什么我的命名组失败了,如何解决?谢谢
我觉得你可以使用
reObj = re.compile("""
\[D\]\. \s* # [D]. and 0+ whitespaces
(?: # An optional alternation group
(?P<PLACE>[^,:]*) # Group "PLACE": 0+ chars other than , and :
(?: # An optional sequence of
: \s* (?P<HOLDER>[^,:]*) # :, 0+ whitespaces, Group "HOLDER" (0+ non-colons and non-commas)
)?
(?: # An optional sequence of
,\s* (?P<YEAR>[12]\d{3}) # , + 0+ whitespaces, Group "YEAR" (1 or 2 and then three digits
)?
)?
$ # end of string
""", flags=re.X)
参见 regex and Python demos:
import re
reObj = re.compile(
r"\[D\]\.\s*(?:(?P<PLACE>[^,:]*)(?::\s*(?P<HOLDER>[^,:]*))?(?:,\s*(?P<YEAR>[12]\d{3}))?)?$",
re.VERBOSE
)
txt = '''Smith. The paper name. The subtitle of paper[D]. US: MIT, 2011
Smith. The paper name. The subtitle of paper[D]. US, 2011
Smith. The paper name. The subtitle of paper[D]. US: MIT'''.split('\n')
for i in txt:
print('------------------------\nTESTING {}'.format(i))
m = reObj.search(i)
if m:
if not m.group('PLACE'):
print('missing place')
else:
print(m.group('PLACE'))
if not m.group('YEAR'):
print('missing year')
else:
print(m.group('YEAR'))
输出:
------------------------
TESTING Smith. The paper name. The subtitle of paper[D]. US: MIT, 2011
US
2011
------------------------
TESTING Smith. The paper name. The subtitle of paper[D]. US, 2011
US
2011
------------------------
TESTING Smith. The paper name. The subtitle of paper[D]. US: MIT
US
missing year
我尝试对论文参考文献的格式提出建议。例如学术论文,格式为:
author. dissertation name[D]. place where store it: organization who hold the copy, year in which the dissertation published.
很明显,除年份外,每一项都可能有一些标点符号。例如
Smith. The paper name. The subtitle of paper[D]. United States: MIT, 2011
经常会漏掉place where store it
和year
,例如
Smith. The paper name. The subtitle of paper[D]. US, 2011
Smith. The paper name. The subtitle of paper[D]. US: MIT
我想这样编程:
import re
reObj = re.compile(
r'.*\[D\]\. \s* ((?P<PLACE>[^:]*):){0,1} \s* (?P<HOLDER>[^:]*) (?P<YEAR>,\s*(1|2)\d{3}){0,1}',
re.VERBOSE
)
txt = '''Smith. The paper name. The subtitle of paper[D]. US: MIT, 2011
Smith. The paper name. The subtitle of paper[D]. US, 2011
Smith. The paper name. The subtitle of paper[D]. US: MIT'''.split('\n')
for i in txt:
if reObj.search(i):
if reObj.search(i).group('PLACE')==None:
print('missing place')
if reObj.search(i).group('YEAR')==None:
print('missing year')
else:
print('bad formation')
但我发现没有得到 YEAR 对于我在 txt 中: 打印(i) 打印(reObj.search(i).group('HOLDER'))
产出
Smith. The paper name. The subtitle of paper[D]. US: MIT, 2011
MIT, 2011
Smith. The paper name. The subtitle of paper[D]. US, 2011
US, 2011
Smith. The paper name. The subtitle of paper[D]. US: MIT
MIT
for i in txt:
print(i)
print(reObj.search(i).group('YEAR'))
产出
Smith. The paper name. The subtitle of paper[D]. US: MIT, 2011
None
Smith. The paper name. The subtitle of paper[D]. US, 2011
None
Smith. The paper name. The subtitle of paper[D]. US: MIT
None
那么,为什么我的命名组失败了,如何解决?谢谢
我觉得你可以使用
reObj = re.compile("""
\[D\]\. \s* # [D]. and 0+ whitespaces
(?: # An optional alternation group
(?P<PLACE>[^,:]*) # Group "PLACE": 0+ chars other than , and :
(?: # An optional sequence of
: \s* (?P<HOLDER>[^,:]*) # :, 0+ whitespaces, Group "HOLDER" (0+ non-colons and non-commas)
)?
(?: # An optional sequence of
,\s* (?P<YEAR>[12]\d{3}) # , + 0+ whitespaces, Group "YEAR" (1 or 2 and then three digits
)?
)?
$ # end of string
""", flags=re.X)
参见 regex and Python demos:
import re
reObj = re.compile(
r"\[D\]\.\s*(?:(?P<PLACE>[^,:]*)(?::\s*(?P<HOLDER>[^,:]*))?(?:,\s*(?P<YEAR>[12]\d{3}))?)?$",
re.VERBOSE
)
txt = '''Smith. The paper name. The subtitle of paper[D]. US: MIT, 2011
Smith. The paper name. The subtitle of paper[D]. US, 2011
Smith. The paper name. The subtitle of paper[D]. US: MIT'''.split('\n')
for i in txt:
print('------------------------\nTESTING {}'.format(i))
m = reObj.search(i)
if m:
if not m.group('PLACE'):
print('missing place')
else:
print(m.group('PLACE'))
if not m.group('YEAR'):
print('missing year')
else:
print(m.group('YEAR'))
输出:
------------------------
TESTING Smith. The paper name. The subtitle of paper[D]. US: MIT, 2011
US
2011
------------------------
TESTING Smith. The paper name. The subtitle of paper[D]. US, 2011
US
2011
------------------------
TESTING Smith. The paper name. The subtitle of paper[D]. US: MIT
US
missing year