使用 lxml Python 同时检索单个对象的许多嵌套语句
Retrieve many nested statements at the same time for a single object with lxml Python
我正在使用 big xml 检索许多不同的属性,现在我正在尝试检索 comment category
属性 并将其连接到标签之间的文本。但是,我需要处理 3 种不同的情况。 XML 示例:
<comment-list>
<comment category="Derived from sampling site"> Peripheral blood </comment>
<comment category="Transformant">
<cv-term terminology="NCBI-Taxonomy" accession="10376">Epstein-Barr virus (EBV)</cv-term>
</comment>
<comment category="Sequence variation"> Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194)
</comment>
<comment category="Monoclonal antibody target">
<xref-list>
<xref database="UniProtKB" category="Sequence databases" accession="Q5T5X7">
<property-list>
<property name="gene/protein designation" value="Human BEND3"/>
</property-list>
<url><![CDATA[https://www.uniprot.org/uniprot/Q5T5X7]]></url>
</xref>
</xref-list>
</comment>
</comment-list>
- 当
<comment>
下没有子标签。然后我需要检索 comment category
属性
并将其与标签之间的文本连接起来。
- 当
<comment>
下面嵌套了一个 <cv-term>
标签。然后我需要检索 comment category
,
cv-term terminology
、cv-term accession
和 cv-term
标签之间的文本。
- 当
<comment>
下面嵌套了几个标签时:<xref-list>
-<xref>
-<property-list>
-
<property>
-<url>
。在这种情况下,我需要检索:comment category
,
xref database
属性、xref accession
属性 和 property value
属性.
我正在使用 lxml 来解析这个 XML,我正在努力思考如何解决案例 2。案例 1 和案例 3 有效,但是当一个对象具有所有三个案例然后输出变得混乱。
我想收到以下输出:
Derived from sampling site: Peripheral blood
Transformant: NCBI-Taxonomy, 10376, Epstein-Barr virus (EBV)
Sequence variation: Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194)
Monoclonal antibody target: UniProtKB, Q5T5X7, Human BEND3
这是我非常混乱的代码,它以错误的顺序输出元素。对于案例 1 和案例 3,它工作正常,但是当案例 2 起作用时,输出顺序错误:
comment_cat = att.xpath('.//comment-list/comment/@category')
comment_text = att.xpath('.//comment-list/comment/text()')
cv_term = att.xpath('.//comment-list/comment/cv-term/text()')
xref = [a + ', ' + b for a,b in zip(att.xpath('.//comment-list/comment/xref-
list/xref/@database'),att.xpath('.//comment-list/comment/xref-list/xref/@accession'))]
property_list = att.xpath('.//comment-list/comment/xref-list/xref/property-list/property/@value')
xref_property_list = [a + ', ' + b for a,b in zip(xref, property_list)]
empty_str_in_text = ['\n ', '\n ', '\n ', '\n ']
comment_texts_all = cv_term+comment_text+xref_property_list
for e in empty_str_in_text:
if e in comment_texts_all:
comment_texts_all.remove(e)
key_values['Comments'] = ';; '.join([i + ': ' + j for i, j in zip(comment_cat,
comment_texts_all)])
输出:
Derived from sampling site: Epstein-Barr virus (EBV);;
Transformant: Peripheral blood ;;
Sequence variation: Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194) ;;
Monoclonal antibody target: UniProtKB, Q5T5X7, Human BEND3
这里有一个稍微不同的方法:
xml = '''<comment-list>
<comment category="Derived from sampling site"> Peripheral blood </comment>
<comment category="Transformant">
<cv-term terminology="NCBI-Taxonomy" accession="10376">Epstein-Barr virus (EBV)</cv-term>
</comment>
<comment category="Sequence variation"> Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194)</comment>
<comment category="Monoclonal antibody target">
<xref-list>
<xref database="UniProtKB" category="Sequence databases" accession="Q5T5X7">
<property-list>
<property name="gene/protein designation" value="Human BEND3"/>
</property-list>
<url><![CDATA[https://www.uniprot.org/uniprot/Q5T5X7]]></url>
</xref>
</xref-list>
</comment>
<comment category="Knockout cell">
<method>KO mouse</method>
<xref-list>
<xref database="MGI" category="Organism-specific " accession="MGI:97740">
<property-list>
<property name="gene/protein designation" value="Polb"/>
</property-list>
<url><![CDATA[http://www.informatics.jax.org//MGI:97740]]></url>
</xref>
</xref-list>
</comment>
</comment-list>'''
from lxml import etree as ET
tree = ET.fromstring(xml)
result = ''
for comment in tree.iter('comment'):
result += f"{comment.get('category')}: "
cv_term = comment.find('cv-term')
xref_list = comment.find('xref-list')
method = comment.find('method')
if len(list(comment)) == 0:
result += comment.text
elif cv_term is not None:
result += ', '.join([cv_term.get('terminology'), cv_term.get('accession'), cv_term.text])
elif xref_list is not None and method is None:
result += ', '.join([xref_list.xpath('./xref/@database')[0], xref_list.xpath('./xref/@accession')[0], xref_list.xpath('./xref/property-list/property/@value')[0]])
elif method is not None:
result += method.text
result += '\n'
print(result)
输出:
Derived from sampling site: Peripheral blood
Transformant: NCBI-Taxonomy, 10376, Epstein-Barr virus (EBV)
Sequence variation: Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194)
Monoclonal antibody target: UniProtKB, Q5T5X7, Human BEND3
Knockout cell: KO mouse
我正在使用 big xml 检索许多不同的属性,现在我正在尝试检索 comment category
属性 并将其连接到标签之间的文本。但是,我需要处理 3 种不同的情况。 XML 示例:
<comment-list>
<comment category="Derived from sampling site"> Peripheral blood </comment>
<comment category="Transformant">
<cv-term terminology="NCBI-Taxonomy" accession="10376">Epstein-Barr virus (EBV)</cv-term>
</comment>
<comment category="Sequence variation"> Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194)
</comment>
<comment category="Monoclonal antibody target">
<xref-list>
<xref database="UniProtKB" category="Sequence databases" accession="Q5T5X7">
<property-list>
<property name="gene/protein designation" value="Human BEND3"/>
</property-list>
<url><![CDATA[https://www.uniprot.org/uniprot/Q5T5X7]]></url>
</xref>
</xref-list>
</comment>
</comment-list>
- 当
<comment>
下没有子标签。然后我需要检索comment category
属性 并将其与标签之间的文本连接起来。 - 当
<comment>
下面嵌套了一个<cv-term>
标签。然后我需要检索comment category
,cv-term terminology
、cv-term accession
和cv-term
标签之间的文本。 - 当
<comment>
下面嵌套了几个标签时:<xref-list>
-<xref>
-<property-list>
-<property>
-<url>
。在这种情况下,我需要检索:comment category
,xref database
属性、xref accession
属性 和property value
属性.
我正在使用 lxml 来解析这个 XML,我正在努力思考如何解决案例 2。案例 1 和案例 3 有效,但是当一个对象具有所有三个案例然后输出变得混乱。
我想收到以下输出:
Derived from sampling site: Peripheral blood
Transformant: NCBI-Taxonomy, 10376, Epstein-Barr virus (EBV)
Sequence variation: Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194)
Monoclonal antibody target: UniProtKB, Q5T5X7, Human BEND3
这是我非常混乱的代码,它以错误的顺序输出元素。对于案例 1 和案例 3,它工作正常,但是当案例 2 起作用时,输出顺序错误:
comment_cat = att.xpath('.//comment-list/comment/@category')
comment_text = att.xpath('.//comment-list/comment/text()')
cv_term = att.xpath('.//comment-list/comment/cv-term/text()')
xref = [a + ', ' + b for a,b in zip(att.xpath('.//comment-list/comment/xref-
list/xref/@database'),att.xpath('.//comment-list/comment/xref-list/xref/@accession'))]
property_list = att.xpath('.//comment-list/comment/xref-list/xref/property-list/property/@value')
xref_property_list = [a + ', ' + b for a,b in zip(xref, property_list)]
empty_str_in_text = ['\n ', '\n ', '\n ', '\n ']
comment_texts_all = cv_term+comment_text+xref_property_list
for e in empty_str_in_text:
if e in comment_texts_all:
comment_texts_all.remove(e)
key_values['Comments'] = ';; '.join([i + ': ' + j for i, j in zip(comment_cat,
comment_texts_all)])
输出:
Derived from sampling site: Epstein-Barr virus (EBV);;
Transformant: Peripheral blood ;;
Sequence variation: Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194) ;;
Monoclonal antibody target: UniProtKB, Q5T5X7, Human BEND3
这里有一个稍微不同的方法:
xml = '''<comment-list>
<comment category="Derived from sampling site"> Peripheral blood </comment>
<comment category="Transformant">
<cv-term terminology="NCBI-Taxonomy" accession="10376">Epstein-Barr virus (EBV)</cv-term>
</comment>
<comment category="Sequence variation"> Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194)</comment>
<comment category="Monoclonal antibody target">
<xref-list>
<xref database="UniProtKB" category="Sequence databases" accession="Q5T5X7">
<property-list>
<property name="gene/protein designation" value="Human BEND3"/>
</property-list>
<url><![CDATA[https://www.uniprot.org/uniprot/Q5T5X7]]></url>
</xref>
</xref-list>
</comment>
<comment category="Knockout cell">
<method>KO mouse</method>
<xref-list>
<xref database="MGI" category="Organism-specific " accession="MGI:97740">
<property-list>
<property name="gene/protein designation" value="Polb"/>
</property-list>
<url><![CDATA[http://www.informatics.jax.org//MGI:97740]]></url>
</xref>
</xref-list>
</comment>
</comment-list>'''
from lxml import etree as ET
tree = ET.fromstring(xml)
result = ''
for comment in tree.iter('comment'):
result += f"{comment.get('category')}: "
cv_term = comment.find('cv-term')
xref_list = comment.find('xref-list')
method = comment.find('method')
if len(list(comment)) == 0:
result += comment.text
elif cv_term is not None:
result += ', '.join([cv_term.get('terminology'), cv_term.get('accession'), cv_term.text])
elif xref_list is not None and method is None:
result += ', '.join([xref_list.xpath('./xref/@database')[0], xref_list.xpath('./xref/@accession')[0], xref_list.xpath('./xref/property-list/property/@value')[0]])
elif method is not None:
result += method.text
result += '\n'
print(result)
输出:
Derived from sampling site: Peripheral blood
Transformant: NCBI-Taxonomy, 10376, Epstein-Barr virus (EBV)
Sequence variation: Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194)
Monoclonal antibody target: UniProtKB, Q5T5X7, Human BEND3
Knockout cell: KO mouse