使用 Python 解析 BibTeX 引文格式
Parsing BibTeX citation format with Python
python 中解析这些结果的最佳方法是什么?我试过正则表达式但无法正常工作。我正在寻找标题、作者等作为键的字典。
@article{perry2000epidemiological,
title={An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study},
author={Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others},
journal={Journal of public health},
volume={22},
number={3},
pages={427--434},
year={2000},
publisher={Oxford University Press}
}
您可以使用正则表达式:
import re
s = """
@article{perry2000epidemiological,
title={An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study},
author={Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others},
journal={Journal of public health},
volume={22},
number={3},
pages={427--434},
year={2000},
publisher={Oxford University Press}
}
"""
results = re.findall('(?<=@article\{)[a-zA-Z0-9]+|(?<=\=\{)[a-zA-Z0-9:\s,]+|[a-zA-Z]+(?=\=)|@[a-zA-Z0-9]+', s)
final_results = {results[i][1:] if results[i].startswith('@') else results[i]:int(results[i+1]) if results[i+1].isdigit() else results[i+1] for i in range(0, len(results), 2)}
输出:
{'publisher': 'Oxford University Press', 'author': 'Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others', 'journal': 'Journal of public health', 'title': 'An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study', 'number': 3, 'volume': 22, 'year': 2000, 'article': 'perry2000epidemiological', 'pages': 427}
您可能正在寻找 re.split
:
import re
article_dict = {}
with open('inp.txt') as f:
for line in f.readlines()[1:-1]:
info = re.split(r'=',line.strip())
article_dict[info[0]] = info[1]
我假设您需要去掉末尾的大括号和逗号,这只是一个简单的替换或切片任务。
{'title': '{An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study},',
'author': '{Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others},',
'journal': '{Journal of public health},',
'volume': '{22},',
'number': '{3},',
'pages': '{427--434},',
'year': '{2000},',
'publisher': '{Oxford University Press}'}
这看起来像是一种引用格式。你可以这样解析它:
>>> import re
>>> kv = re.compile(r'\b(?P<key>\w+)={(?P<value>[^}]+)}')
>>> citation = """
... @article{perry2000epidemiological,
... title={An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence
... Study},
... author={Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and
... Smith, Nigel and Clarke, Michael and Jagger, Carol and others},
... journal={Journal of public health},
... volume={22},
... number={3},
... pages={427--434},
... year={2000},
... publisher={Oxford University Press}
... }
... """
>>> dict(kv.findall(citation))
{'author': 'Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others',
'journal': 'Journal of public health',
'number': '3',
'pages': '427--434',
'publisher': 'Oxford University Press',
'title': 'An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study',
'volume': '22',
'year': '2000'}
正则表达式使用两个命名的捕获组(主要是为了直观地表示什么是什么)。
- "key" 是任何 1+ unicode 单词字符,左边是单词边界,右边是文字等号;
- "value" 是两个大括号内的内容。只要您不希望有 "nested" 大括号,就可以方便地使用
[^}]
。换句话说,这些值只是大括号内不是大括号的任何字符中的一个或多个。
您可能正在寻找 BibTeX-parser: https://bibtexparser.readthedocs.io/en/master/
来源:https://bibtexparser.readthedocs.io/en/master/tutorial.html#step-0-vocabulary
Input/Create bibtex 文件:
bibtex = """@ARTICLE{Cesar2013,
author = {Jean César},
title = {An amazing title},
year = {2013},
month = jan,
volume = {12},
pages = {12--23},
journal = {Nice Journal},
abstract = {This is an abstract. This line should be long enough to test
multilines...},
comments = {A comment},
keywords = {keyword1, keyword2}
}
"""
with open('bibtex.bib', 'w') as bibfile:
bibfile.write(bibtex)
解析它:
import bibtexparser
with open('bibtex.bib') as bibtex_file:
bib_database = bibtexparser.load(bibtex_file)
print(bib_database.entries)
输出:
[{'journal': 'Nice Journal',
'comments': 'A comment',
'pages': '12--23',
'month': 'jan',
'abstract': 'This is an abstract. This line should be long enough to test\nmultilines...',
'title': 'An amazing title',
'year': '2013',
'volume': '12',
'ID': 'Cesar2013',
'author': 'Jean César',
'keyword': 'keyword1, keyword2',
'ENTRYTYPE': 'article'}]
python 中解析这些结果的最佳方法是什么?我试过正则表达式但无法正常工作。我正在寻找标题、作者等作为键的字典。
@article{perry2000epidemiological,
title={An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study},
author={Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others},
journal={Journal of public health},
volume={22},
number={3},
pages={427--434},
year={2000},
publisher={Oxford University Press}
}
您可以使用正则表达式:
import re
s = """
@article{perry2000epidemiological,
title={An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study},
author={Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others},
journal={Journal of public health},
volume={22},
number={3},
pages={427--434},
year={2000},
publisher={Oxford University Press}
}
"""
results = re.findall('(?<=@article\{)[a-zA-Z0-9]+|(?<=\=\{)[a-zA-Z0-9:\s,]+|[a-zA-Z]+(?=\=)|@[a-zA-Z0-9]+', s)
final_results = {results[i][1:] if results[i].startswith('@') else results[i]:int(results[i+1]) if results[i+1].isdigit() else results[i+1] for i in range(0, len(results), 2)}
输出:
{'publisher': 'Oxford University Press', 'author': 'Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others', 'journal': 'Journal of public health', 'title': 'An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study', 'number': 3, 'volume': 22, 'year': 2000, 'article': 'perry2000epidemiological', 'pages': 427}
您可能正在寻找 re.split
:
import re
article_dict = {}
with open('inp.txt') as f:
for line in f.readlines()[1:-1]:
info = re.split(r'=',line.strip())
article_dict[info[0]] = info[1]
我假设您需要去掉末尾的大括号和逗号,这只是一个简单的替换或切片任务。
{'title': '{An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study},',
'author': '{Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others},',
'journal': '{Journal of public health},',
'volume': '{22},',
'number': '{3},',
'pages': '{427--434},',
'year': '{2000},',
'publisher': '{Oxford University Press}'}
这看起来像是一种引用格式。你可以这样解析它:
>>> import re
>>> kv = re.compile(r'\b(?P<key>\w+)={(?P<value>[^}]+)}')
>>> citation = """
... @article{perry2000epidemiological,
... title={An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence
... Study},
... author={Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and
... Smith, Nigel and Clarke, Michael and Jagger, Carol and others},
... journal={Journal of public health},
... volume={22},
... number={3},
... pages={427--434},
... year={2000},
... publisher={Oxford University Press}
... }
... """
>>> dict(kv.findall(citation))
{'author': 'Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others',
'journal': 'Journal of public health',
'number': '3',
'pages': '427--434',
'publisher': 'Oxford University Press',
'title': 'An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study',
'volume': '22',
'year': '2000'}
正则表达式使用两个命名的捕获组(主要是为了直观地表示什么是什么)。
- "key" 是任何 1+ unicode 单词字符,左边是单词边界,右边是文字等号;
- "value" 是两个大括号内的内容。只要您不希望有 "nested" 大括号,就可以方便地使用
[^}]
。换句话说,这些值只是大括号内不是大括号的任何字符中的一个或多个。
您可能正在寻找 BibTeX-parser: https://bibtexparser.readthedocs.io/en/master/
来源:https://bibtexparser.readthedocs.io/en/master/tutorial.html#step-0-vocabulary
Input/Create bibtex 文件:
bibtex = """@ARTICLE{Cesar2013, author = {Jean César}, title = {An amazing title}, year = {2013}, month = jan, volume = {12}, pages = {12--23}, journal = {Nice Journal}, abstract = {This is an abstract. This line should be long enough to test multilines...}, comments = {A comment}, keywords = {keyword1, keyword2} } """ with open('bibtex.bib', 'w') as bibfile: bibfile.write(bibtex)
解析它:
import bibtexparser with open('bibtex.bib') as bibtex_file: bib_database = bibtexparser.load(bibtex_file) print(bib_database.entries)
输出:
[{'journal': 'Nice Journal', 'comments': 'A comment', 'pages': '12--23', 'month': 'jan', 'abstract': 'This is an abstract. This line should be long enough to test\nmultilines...', 'title': 'An amazing title', 'year': '2013', 'volume': '12', 'ID': 'Cesar2013', 'author': 'Jean César', 'keyword': 'keyword1, keyword2', 'ENTRYTYPE': 'article'}]