如何在 Python 中对 GENIA 语料库进行 XML 解析

Question

我有以下 XML 模式，我想解析它以获取一个列表中的所有完整句子和一个列表中标记之间的所有文本

<article>
<articleinfo>
<bibliomisc>MEDLINE:95369245</bibliomisc>
</articleinfo>
<title>
<sentence><cons lex="IL-2_gene_expression" sem="G#other_name"><cons lex="IL-2_gene" sem="G#DNA_domain_or_region">IL-2 gene</cons> expression</cons> and <cons lex="NF-kappa_B_activation" sem="G#other_name"><cons lex="NF-kappa_B" sem="G#protein_molecule">NF-kappa B</cons> activation</cons> through <cons lex="CD28" sem="G#protein_molecule">CD28</cons> requires reactive oxygen production by <cons lex="5-lipoxygenase" sem="G#protein_molecule">5-lipoxygenase</cons>.</sentence>
</title>
<abstract>
<sentence>Activation of the <cons lex="CD28_surface_receptor" sem="G#protein_family_or_group"><cons lex="CD28" sem="G#protein_molecule">CD28</cons> surface receptor</cons> provides a major costimulatory signal for <cons lex="T_cell_activation" sem="G#other_name">T cell activation</cons> resulting in enhanced production of <cons lex="interleukin-2" sem="G#protein_molecule">interleukin-2</cons> (<cons lex="IL-2" sem="G#protein_molecule">IL-2</cons>) and <cons lex="cell_proliferation" sem="G#other_name">cell proliferation</cons>.</sentence>
<sentence>In <cons lex="primary_T_lymphocyte" sem="G#cell_type">primary T lymphocytes</cons> we show that <cons lex="CD28" sem="G#protein_molecule">CD28</cons> ligation leads to the rapid intracellular formation of <cons lex="reactive_oxygen_intermediate" sem="G#inorganic">reactive oxygen intermediates</cons> (<cons lex="ROI" sem="G#inorganic">ROIs</cons>) which are required for <cons lex="CD28-mediated_activation" sem="G#other_name"><cons lex="CD28" sem="G#protein_molecule">CD28</cons>-mediated activation</cons> of the <cons lex="NF-kappa_B" sem="G#protein_molecule">NF-kappa B</cons>/<cons lex="CD28-responsive_complex" sem="G#protein_complex"><cons lex="CD28" sem="G#protein_molecule">CD28</cons>-responsive complex</cons> and <cons lex="IL-2_expression" sem="G#other_name"><cons lex="IL-2" sem="G#protein_molecule">IL-2</cons> expression</cons>.</sentence>
<sentence>Delineation of the <cons lex="CD28_signaling_cascade" sem="G#other_name"><cons lex="CD28" sem="G#protein_molecule">CD28</cons> signaling cascade</cons> was found to involve <cons lex="protein_tyrosine_kinase_activity" sem="G#other_name"><cons lex="protein_tyrosine_kinase" sem="G#protein_family_or_group">protein tyrosine kinase</cons> activity</cons>, followed by the activation of <cons lex="phospholipase_A2" sem="G#protein_molecule">phospholipase A2</cons> and <cons lex="5-lipoxygenase" sem="G#protein_molecule">5-lipoxygenase</cons>.</sentence>
<sentence>Our data suggest that <cons lex="lipoxygenase_metabolite" sem="G#protein_family_or_group"><cons lex="lipoxygenase" sem="G#protein_molecule">lipoxygenase</cons> metabolites</cons> activate <cons lex="ROI_formation" sem="G#other_name"><cons lex="ROI" sem="G#inorganic">ROI</cons> formation</cons> which then induce <cons lex="IL-2" sem="G#protein_molecule">IL-2</cons> expression via <cons lex="NF-kappa_B_activation" sem="G#other_name"><cons lex="NF-kappa_B" sem="G#protein_molecule">NF-kappa B</cons> activation</cons>.</sentence>
<sentence>These findings should be useful for <cons lex="therapeutic_strategies" sem="G#other_name">therapeutic strategies</cons> and the development of <cons lex="immunosuppressants" sem="G#other_name">immunosuppressants</cons> targeting the <cons lex="CD28_costimulatory_pathway" sem="G#other_name"><cons lex="CD28" sem="G#protein_molecule">CD28</cons> costimulatory pathway</cons>.</sentence>
</abstract>
</article>
</set>

我试过这样做


import xml.etree.ElementTree as ET



root = ET.parse("test.xml").getroot()

sent= [elem.text for elem in root.iter('sentence')]
print(sent)
terms =  [elem.text for elem in root.iter('cons')]

print(terms)

但这给出了以下输出。

[None, 'Activation of the ', 'In ', 'Delineation of the ', 'Our data suggest that ', 'These findings should be useful for ']
[None, 'IL-2 gene', None, 'NF-kappa B', 'CD28', '5-lipoxygenase', None, 'CD28', 'T cell activation', 'interleukin-2', 'IL-2', 'cell proliferation', 'primary T lymphocytes', 'CD28', 'reactive oxygen intermediates', 'ROIs', None, 'CD28', 'NF-kappa B', None, 'CD28', None, 'IL-2', None, 'CD28', None, 'protein tyrosine kinase', 'phospholipase A2', '5-lipoxygenase', None, 'lipoxygenase', None, 'ROI', 'IL-2', None, 'NF-kappa B', 'therapeutic strategies', 'immunosuppressants', None, 'CD28']

我想要一个更接近于以下的输出

['IL-2 gene expression and NF-kappa B activation through CD28 requires oxygen production by 5-lipoxygenase', ...]
['IL-2 gene','NF-kappa B', 'CD28', '5-lipoxygenase',...]

术语列表在我的输出中看起来不错，但如何在我的 sent 列表中得到完整的句子，而不是我目前得到的断句。

Answer 1

棘手的部分是 xml 中的某些文本不是 .text；这是.tail.

对于句子来说，做这样的事情很容易：

sent = [''.join(elem.itertext()) for elem in root.iter('sentence')]

对于术语（缺点），它有点不同，因为看起来您想要忽略具有 child cons 的 cons 元素的文本。（真的，您不想要 child cons 的 .text。）

在那种情况下，如果它不是 None...

，只需抓住 .text

terms = [elem.text for elem in tree.iter('cons') if elem.text]

完整示例...

import xml.etree.ElementTree as ET

tree = ET.parse('test.xml')

sent = [''.join(elem.itertext()) for elem in tree.iter('sentence')]
print(sent)

terms = [elem.text for elem in tree.iter('cons') if elem.text]
print(terms)

打印...

['IL-2 gene expression and NF-kappa B activation through CD28 requires reactive oxygen production by 5-lipoxygenase.', 'Activation of the CD28 surface receptor provides a major costimulatory signal for T cell activation resulting in enhanced production of interleukin-2 (IL-2) and cell proliferation.', 'In primary T lymphocytes we show that CD28 ligation leads to the rapid intracellular formation of reactive oxygen intermediates (ROIs) which are required for CD28-mediated activation of the NF-kappa B/CD28-responsive complex and IL-2 expression.', 'Delineation of the CD28 signaling cascade was found to involve protein tyrosine kinase activity, followed by the activation of phospholipase A2 and 5-lipoxygenase.', 'Our data suggest that lipoxygenase metabolites activate ROI formation which then induce IL-2 expression via NF-kappa B activation.', 'These findings should be useful for therapeutic strategies and the development of immunosuppressants targeting the CD28 costimulatory pathway.']
['IL-2 gene', 'NF-kappa B', 'CD28', '5-lipoxygenase', 'CD28', 'T cell activation', 'interleukin-2', 'IL-2', 'cell proliferation', 'primary T lymphocytes', 'CD28', 'reactive oxygen intermediates', 'ROIs', 'CD28', 'NF-kappa B', 'CD28', 'IL-2', 'CD28', 'protein tyrosine kinase', 'phospholipase A2', '5-lipoxygenase', 'lipoxygenase', 'ROI', 'IL-2', 'NF-kappa B', 'therapeutic strategies', 'immunosuppressants', 'CD28']

注意：terms 会有重复。如果您需要删除重复项，有几种不同的方法可以做到这一点。例如，使用 set():

terms = list(set(elem.text for elem in tree.iter('cons') if elem.text))

如何在 Python 中对 GENIA 语料库进行 XML 解析

How to do XML Parsing on GENIA corpus in Python

python

xml

nlp

elementtree

python-3.x