Python lxml 高效查找文本
Python lxml find text efficiently
使用 python lxml 我想测试 XML 文档是否包含 EXPERIMENT_TYPE,如果存在,则提取 。
示例:
<EXPERIMENT_SET>
<EXPERIMENT center_name="BCCA" alias="Experiment-pass_2.0">
<TITLE>WGBS (whole genome bisulfite sequencing) analysis of SomeSampleA (library: SomeLibraryA).</TITLE>
<STUDY_REF accession="SomeStudy" refcenter="BCCA"/>
<EXPERIMENT_ATTRIBUTES>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_TYPE</TAG><VALUE>DNA Methylation</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_URI</TAG><VALUE>http://purl.obolibrary.org/obo/OBI_0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_CURIE</TAG><VALUE>obi:0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>MOLECULE</TAG><VALUE>genomic DNA</VALUE></EXPERIMENT_ATTRIBUTE>
</EXPERIMENT_ATTRIBUTES>
</EXPERIMENT>
</EXPERIMENT_SET>
有没有比遍历所有元素更快的方法?
all = etree.findall('EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE/TAG')
for e in all:
if e.text == 'EXPERIMENT_TYPE':
print("Found")
当我想提取 时,该尝试也变得一团糟。
就 XPath 而言,您似乎只想 select 基于 TAG 元素的 VALUE 元素,例如/EXPERIMENT_SET/EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE[TAG = 'EXPERIMENT_TYPE']/VALUE
.
我认为 Python 和 lxml 人们经常使用文本节点 selection 例如/EXPERIMENT_SET/EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE[TAG = 'EXPERIMENT_TYPE']/VALUE/text()
作为 xpath 函数 returns 作为 Python 字符串。
使用 findall
是很自然的做法。我建议使用以下代码来查找值:
from lxml import etree
root = etree.parse('toto.xml').getroot()
all = root.findall('EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE/TAG')
for e in all:
if e.text == 'EXPERIMENT_TYPE':
v = e.getparent().find('VALUE')
if v is not None:
print(f'Found val="{v.text}"')
这输出:
Found val="DNA Methylation"
您最好使用 XPath 执行此操作,这肯定会非常快。我的建议(经过测试和工作)。它将 return 一个(可能为空)VALUE 元素列表,您可以从中额外添加 text
.
PS:不要使用all
等“特殊”词作为变量名。错误的做法,可能会导致意外的错误。
import lxml.etree as ET
from lxml.etree import Element
from typing import List
xml_str = """
<EXPERIMENT_SET>
<EXPERIMENT center_name="BCCA" alias="Experiment-pass_2.0">
<TITLE>WGBS (whole genome bisulfite sequencing) analysis of SomeSampleA (library: SomeLibraryA).</TITLE>
<STUDY_REF accession="SomeStudy" refcenter="BCCA"/>
<EXPERIMENT_ATTRIBUTES>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_TYPE</TAG><VALUE>DNA Methylation</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_URI</TAG><VALUE>http://purl.obolibrary.org/obo/OBI_0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_CURIE</TAG><VALUE>obi:0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>MOLECULE</TAG><VALUE>genomic DNA</VALUE></EXPERIMENT_ATTRIBUTE>
</EXPERIMENT_ATTRIBUTES>
</EXPERIMENT>
</EXPERIMENT_SET>
"""
tree = ET.ElementTree(ET.fromstring(xml_str))
vals: List[Element] = tree.xpath(".//EXPERIMENT_ATTRIBUTE/TAG[text()='EXPERIMENT_TYPE']/following-sibling::VALUE")
print(vals[0].text)
# DNA Methylation
Michael Kay 在下面提供了另一种 XPath 声明,它与 Martin Honnen 的回答相同。
.//EXPERIMENT_ATTRIBUTE[TAG='EXPERIMENT_TYPE']/VALUE
使用 python lxml 我想测试 XML 文档是否包含 EXPERIMENT_TYPE,如果存在,则提取
示例:
<EXPERIMENT_SET>
<EXPERIMENT center_name="BCCA" alias="Experiment-pass_2.0">
<TITLE>WGBS (whole genome bisulfite sequencing) analysis of SomeSampleA (library: SomeLibraryA).</TITLE>
<STUDY_REF accession="SomeStudy" refcenter="BCCA"/>
<EXPERIMENT_ATTRIBUTES>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_TYPE</TAG><VALUE>DNA Methylation</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_URI</TAG><VALUE>http://purl.obolibrary.org/obo/OBI_0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_CURIE</TAG><VALUE>obi:0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>MOLECULE</TAG><VALUE>genomic DNA</VALUE></EXPERIMENT_ATTRIBUTE>
</EXPERIMENT_ATTRIBUTES>
</EXPERIMENT>
</EXPERIMENT_SET>
有没有比遍历所有元素更快的方法?
all = etree.findall('EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE/TAG')
for e in all:
if e.text == 'EXPERIMENT_TYPE':
print("Found")
当我想提取
就 XPath 而言,您似乎只想 select 基于 TAG 元素的 VALUE 元素,例如/EXPERIMENT_SET/EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE[TAG = 'EXPERIMENT_TYPE']/VALUE
.
我认为 Python 和 lxml 人们经常使用文本节点 selection 例如/EXPERIMENT_SET/EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE[TAG = 'EXPERIMENT_TYPE']/VALUE/text()
作为 xpath 函数 returns 作为 Python 字符串。
使用 findall
是很自然的做法。我建议使用以下代码来查找值:
from lxml import etree
root = etree.parse('toto.xml').getroot()
all = root.findall('EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE/TAG')
for e in all:
if e.text == 'EXPERIMENT_TYPE':
v = e.getparent().find('VALUE')
if v is not None:
print(f'Found val="{v.text}"')
这输出:
Found val="DNA Methylation"
您最好使用 XPath 执行此操作,这肯定会非常快。我的建议(经过测试和工作)。它将 return 一个(可能为空)VALUE 元素列表,您可以从中额外添加 text
.
PS:不要使用all
等“特殊”词作为变量名。错误的做法,可能会导致意外的错误。
import lxml.etree as ET
from lxml.etree import Element
from typing import List
xml_str = """
<EXPERIMENT_SET>
<EXPERIMENT center_name="BCCA" alias="Experiment-pass_2.0">
<TITLE>WGBS (whole genome bisulfite sequencing) analysis of SomeSampleA (library: SomeLibraryA).</TITLE>
<STUDY_REF accession="SomeStudy" refcenter="BCCA"/>
<EXPERIMENT_ATTRIBUTES>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_TYPE</TAG><VALUE>DNA Methylation</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_URI</TAG><VALUE>http://purl.obolibrary.org/obo/OBI_0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_CURIE</TAG><VALUE>obi:0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>MOLECULE</TAG><VALUE>genomic DNA</VALUE></EXPERIMENT_ATTRIBUTE>
</EXPERIMENT_ATTRIBUTES>
</EXPERIMENT>
</EXPERIMENT_SET>
"""
tree = ET.ElementTree(ET.fromstring(xml_str))
vals: List[Element] = tree.xpath(".//EXPERIMENT_ATTRIBUTE/TAG[text()='EXPERIMENT_TYPE']/following-sibling::VALUE")
print(vals[0].text)
# DNA Methylation
Michael Kay 在下面提供了另一种 XPath 声明,它与 Martin Honnen 的回答相同。
.//EXPERIMENT_ATTRIBUTE[TAG='EXPERIMENT_TYPE']/VALUE