从值标签 Etree 中提取文本 XML python

Question

我想从 value 标签中提取文本，我的 xml 代码片段和尝试如下：

<datas>
  <data>
    <column datatype='string' name='[Sub-Category (group)]' role='dimension' type='nominal'>
      <calculation class='categorical-bin' column='[Product Sub-Category]' new-bin='false'>
        <bin value='&quot;Envelopes&quot;'>
          <value>&quot;Envelopes&quot;</value>
          <value>&quot;Labels&quot;</value>
          <value>&quot;Pens &amp; Art Supplies&quot;</value>
          <value>&quot;Rubber Bands&quot;</value>
          <value>&quot;Scissors, Rulers and Trimmers&quot;</value>
        </bin>
      </calculation>
   </column>      
</data>
</datas>

我的尝试：

root = 'myxmlfile.xml'
valuelist = []
for i in root.findall('./datas/data/column/calculation/bin')
    val  = i.find('value')
    if val:
       for j in val:
           valuelist.append(j.text)

我没有得到正确的结果。

Answer 1

这可能有帮助

# -*- coding: utf-8 -*-
s = """<datas>
  <data>
<column datatype='string' name='[Sub-Category (group)]' role='dimension' type='nominal'>
              <calculation class='categorical-bin' column='[Product Sub-Category]' new-bin='false'>
                <bin value='&quot;Envelopes&quot;'>
                  <value>&quot;Envelopes&quot;</value>
                  <value>&quot;Labels&quot;</value>
                  <value>&quot;Pens &amp; Art Supplies&quot;</value>
                  <value>&quot;Rubber Bands&quot;</value>
                  <value>&quot;Scissors, Rulers and Trimmers&quot;</value>
                </bin>
              </calculation>
    </column>
 </data>
</datas>"""

import xml.etree.ElementTree as et
tree = et.fromstring(s)
for i in tree.findall('.//data/column/calculation/bin'):
    for j in i.findall('value'):
        print(j.text)

输出:

"Envelopes"
"Labels"
"Pens & Art Supplies"
"Rubber Bands"
"Scissors, Rulers and Trimmers"

Answer 2

试试这个：

root = open('/your/path_to_file/data.xml', 'rb+')
doc =  ET.parse(root).getroot()
valuelist = []
for i in doc.findall('.//bin'):
    val  = i.findall('value')
    for v in val:
        valuelist.append(v.text)
print valuelist

输出：

['"Envelopes"', '"Labels"', '"Pens & Art Supplies"', '"Rubber Bands"', '"Scissors, Rulers and Trimmers"']
[Finished in 0.0s]

Answer 3

Rakesh 的回答很好，只是想我会添加一些关于为什么您的代码无法正常工作的解释。

首先，您需要将 XML 转换为 ElementTree - 这基本上只是一个 Python 对象，具有 tree-like 元素和子元素结构，对应于您的 XML，但您可以在 Python 中使用它。

如果您的 XML 在一个文件中（而不仅仅是代码中的一个字符串），您可以：

tree = ET.parse('myxmlfile.xml')

然后 root 就是这棵树的 "outermost" 元素，您需要抓住它才能绕着树工作并找到元素等：

root = tree.getroot()

（如果您执行 ET.fromstring(s)，此 return 是根元素，因此您不需要 getroot 步骤。）

在您的示例中，root 是 datas 元素，这是您的问题之一：您的路径不需要包含 'datas'，因为这是您开始的地方从已经.

val = i.find('value') 只会 return 第一个 value 元素，而不是您想要的所有 value 元素的列表。因此，当您尝试执行 for j in val 时，Python 实际上是在尝试查找 value 元素（不存在）的子元素，因此它没有任何内容可附加到 valuelist。您需要在此处使用 findall()，如果将其与 for 循环结合使用，则无需执行 if val 检查，因为 for 循环很简单如果 findall() 返回为空，则不会运行。

将所有这些放在一起：

import xml.etree.ElementTree as ET

tree = ET.parse('myxmlfile.xml')  # change to wherever your file is located
root = tree.getroot()

binlist = []
for i in root.findall('./data/column/calculation/bin'):
    valuelist = []
    for j in i.findall('value'):
        valuelist.append(j.text)
    binlist.append(valuelist)

binlist 是一个列表，列表中的每个项目都是该 bin 的值列表。

如果你只有一个bin，那么你可以简化后半部分代码：

import xml.etree.ElementTree as ET

tree = ET.parse('myxmlfile.xml')  # change to wherever your file is located
root = tree.getroot()

bin = root.find('./data/column/calculation/bin')
valuelist = []
for j in bin.findall('value'):
   valuelist.append(j.text)

请注意，我使用 ET 而不是 et 来导入 ElementTree（这似乎是惯例）。这也假设 datas 是 XML 的第一个元素。如果您提供的代码片段嵌套在一个更大的 XML 文件中，您需要首先通过执行以下操作来访问该元素：

bin = root.find('<path to bin element>')

这些参考资料可能对您有所帮助：

当我学习使用 ElementTree 时，我发现这作为介绍非常有用：http://effbot.org/zone/element.htm
这个比较详细，还包括iter()和iterparse()：https://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree
官方文档也很好：https://docs.python.org/3.6/library/xml.etree.elementtree.html

从值标签 Etree 中提取文本 XML python

Extract text from value tag Etree XML python

python

xml

elementtree

xml-parsing

python-3.x