如何解析纯MEDLINE格式文件
How to parse pure MEDLINE format file
我需要处理具有以下结构的 MEDLINE 文件:
PMID- 1
OWN - NLM
STAT- MEDLINE
DCOM- 20121113
TI - Formate assay in body fluids: application in methanol poisoning.
PMID- 2
OWN - NLM
STAT- MEDLINE
DCOM- 20121113
TI - Delineation of the intimate details of the backbone conformation of pyridine
nucleotide coenzymes in aqueous solution.
PMID- 21
OWN - NLM
STAT- MEDLINE
DCOM- 20121113
TI - [Biochemical studies on camomile components/III. In vitro studies about the
antipeptic activity of (--)-alpha-bisabolol (author's transl)].
AB - (--)-alpha-Bisabolol has a primary antipeptic action depending on dosage, which
is not caused by an alteration of the pH-value. The proteolytic activity of
pepsin is reduced by 50 percent through addition of bisabolol in the ratio of
1/0.5. The antipeptic action of bisabolol only occurs in case of direct contact.
In case of a previous contact with the substrate, the inhibiting effect is lost.
主要任务是仅打印属于 PMID、TI 和 AB 字段的行。但是,我从下面粘贴的脚本开始。
问题:不知道为什么med.records
对象在处理结束时为空?任何想法表示赞赏。
import re
class Medline:
""" MEDLINE file structure """
def __init__(self, in_file=None):
""" Initialize and parse input """
self.records = []
if in_file:
self.parse(in_file)
def parse(self, in_file):
""" Parse input file """
self.current_tag = None
self.current_record = None
prog = re.compile("^(....)- (.*)")
lines = []
# Skip blank lines
for line in in_file:
line = line.rstrip()
if line == "":
continue
if not line.startswith(" "):
match = prog.match(line)
if match:
tag = match.groups()[0]
field = match.groups()[1]
self.process_field(tag, field)
def process_field(self, tag, field):
""" Process MEDLINE file field """
if tag == "PMID":
self.current_record = {tag: field}
def main():
""" Test the code """
import pprint
with open("medline_file.txt", "rt") as medline_file:
med = Medline(medline_file)
pp = pprint.PrettyPrinter()
pp.pprint(med.records)
if __name__ == "__main__":
main()
打错了。
您在 self.current_record
中的 process_field(self, tag, field)
中保存了您的标签和字段。
self.current_record = {tag: field}
但后来你什么也没做。主要是打印字段记录:
pp.pprint(med.records)
您永远不会在其中附加任何内容。所以它当然是空的。
一个解决方案是:
def process_field(self, tag, field):
""" Process MEDLINE file field """
if tag == "PMID":
self.records.append({tag: field})
这将生成输出:
[{'PMID': '1'}, {'PMID': '2'}, {'PMID': '21'}]
此外:您说 AB 字段很重要。不要忘记,因为你有这一行: if not line.startswith(" "):
只有 AB 的第一行将被保存为标签(例如: AB - (--)-alpha-Bisabolol has a primary antipeptic action depending on dosage, which
)并且所有其他行都被过滤。
我需要处理具有以下结构的 MEDLINE 文件:
PMID- 1
OWN - NLM
STAT- MEDLINE
DCOM- 20121113
TI - Formate assay in body fluids: application in methanol poisoning.
PMID- 2
OWN - NLM
STAT- MEDLINE
DCOM- 20121113
TI - Delineation of the intimate details of the backbone conformation of pyridine
nucleotide coenzymes in aqueous solution.
PMID- 21
OWN - NLM
STAT- MEDLINE
DCOM- 20121113
TI - [Biochemical studies on camomile components/III. In vitro studies about the
antipeptic activity of (--)-alpha-bisabolol (author's transl)].
AB - (--)-alpha-Bisabolol has a primary antipeptic action depending on dosage, which
is not caused by an alteration of the pH-value. The proteolytic activity of
pepsin is reduced by 50 percent through addition of bisabolol in the ratio of
1/0.5. The antipeptic action of bisabolol only occurs in case of direct contact.
In case of a previous contact with the substrate, the inhibiting effect is lost.
主要任务是仅打印属于 PMID、TI 和 AB 字段的行。但是,我从下面粘贴的脚本开始。
问题:不知道为什么med.records
对象在处理结束时为空?任何想法表示赞赏。
import re
class Medline:
""" MEDLINE file structure """
def __init__(self, in_file=None):
""" Initialize and parse input """
self.records = []
if in_file:
self.parse(in_file)
def parse(self, in_file):
""" Parse input file """
self.current_tag = None
self.current_record = None
prog = re.compile("^(....)- (.*)")
lines = []
# Skip blank lines
for line in in_file:
line = line.rstrip()
if line == "":
continue
if not line.startswith(" "):
match = prog.match(line)
if match:
tag = match.groups()[0]
field = match.groups()[1]
self.process_field(tag, field)
def process_field(self, tag, field):
""" Process MEDLINE file field """
if tag == "PMID":
self.current_record = {tag: field}
def main():
""" Test the code """
import pprint
with open("medline_file.txt", "rt") as medline_file:
med = Medline(medline_file)
pp = pprint.PrettyPrinter()
pp.pprint(med.records)
if __name__ == "__main__":
main()
打错了。
您在 self.current_record
中的 process_field(self, tag, field)
中保存了您的标签和字段。
self.current_record = {tag: field}
但后来你什么也没做。主要是打印字段记录:
pp.pprint(med.records)
您永远不会在其中附加任何内容。所以它当然是空的。
一个解决方案是:
def process_field(self, tag, field):
""" Process MEDLINE file field """
if tag == "PMID":
self.records.append({tag: field})
这将生成输出:
[{'PMID': '1'}, {'PMID': '2'}, {'PMID': '21'}]
此外:您说 AB 字段很重要。不要忘记,因为你有这一行: if not line.startswith(" "):
只有 AB 的第一行将被保存为标签(例如: AB - (--)-alpha-Bisabolol has a primary antipeptic action depending on dosage, which
)并且所有其他行都被过滤。