应用元素树解析一个复杂的xml结构
Applying element tree to parse a complex xml structure
我在解析下面的 xml 文件时遇到问题。这是我尝试过的;
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<corpus name="P4P" version="1.0" lng="en" xmlns="http://clic.ub.edu/mbertran/formats/paraphrase-corpus"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://clic.ub.edu/mbertran/
formats/paraphrase-corpus http://clic.ub.edu/mbertran/formats/paraphrase-corpus.xsd">
<snippets>
<snippet id="16488" source_description="type:plagiarism;plagiarism_reference:00061;
offset:47727;length:182;source:P4P;wd_count:37">
All art is imitation of nature.
</snippet>
</snippets>
</corpus>
import xml.etree.ElementTree
#root=xml.etree.ElementTree.parse("C:\Users\P4P_corpus\P4P_corpus_v1.xml").getroot()
source=root.findall('snippets/snippet')
for details in source.findall:
print details.get('source_description')
print details.findtext
我的输出是空的
我想要的输出:
"type:plagiarism;plagiarism_reference:00061;
offset:47727;length:182;source:P4P;wd_count:37"
和All art is imitation of nature.
非常感谢您的建议。
您需要在元素前加上 xml 命名空间。如果你在解析后打印 root 你会得到
<Element '{http://clic.ub.edu/mbertran/formats/paraphrase-corpus}corpus' at 0x7ff7891f6390>
^ this part here is the full name ^
因此要遍历 'snippet' 个元素,您首先找到 'snippets' 个元素和 'snippet' 个元素
for snippets in root.findall('{http://clic.ub.edu/mbertran/formats/paraphrase-corpus}snippets'):
for s in snippets.findall('{http://clic.ub.edu/mbertran/formats/paraphrase-corpus}snippet'):
print s.get('source_description')
您可以阅读有关处理命名空间的内容@https://docs.python.org/2/library/xml.etree.elementtree.html#parsing-xml-with-namespaces
我在解析下面的 xml 文件时遇到问题。这是我尝试过的;
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<corpus name="P4P" version="1.0" lng="en" xmlns="http://clic.ub.edu/mbertran/formats/paraphrase-corpus"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://clic.ub.edu/mbertran/
formats/paraphrase-corpus http://clic.ub.edu/mbertran/formats/paraphrase-corpus.xsd">
<snippets>
<snippet id="16488" source_description="type:plagiarism;plagiarism_reference:00061;
offset:47727;length:182;source:P4P;wd_count:37">
All art is imitation of nature.
</snippet>
</snippets>
</corpus>
import xml.etree.ElementTree
#root=xml.etree.ElementTree.parse("C:\Users\P4P_corpus\P4P_corpus_v1.xml").getroot()
source=root.findall('snippets/snippet')
for details in source.findall:
print details.get('source_description')
print details.findtext
我的输出是空的
我想要的输出:
"type:plagiarism;plagiarism_reference:00061;
offset:47727;length:182;source:P4P;wd_count:37"
和All art is imitation of nature.
非常感谢您的建议。
您需要在元素前加上 xml 命名空间。如果你在解析后打印 root 你会得到
<Element '{http://clic.ub.edu/mbertran/formats/paraphrase-corpus}corpus' at 0x7ff7891f6390>
^ this part here is the full name ^
因此要遍历 'snippet' 个元素,您首先找到 'snippets' 个元素和 'snippet' 个元素
for snippets in root.findall('{http://clic.ub.edu/mbertran/formats/paraphrase-corpus}snippets'):
for s in snippets.findall('{http://clic.ub.edu/mbertran/formats/paraphrase-corpus}snippet'):
print s.get('source_description')
您可以阅读有关处理命名空间的内容@https://docs.python.org/2/library/xml.etree.elementtree.html#parsing-xml-with-namespaces