从 XML 文件中提取评论 Python

Question

我想提取 XML 文件的注释部分。我想提取的信息位于 Tag 之间，然后位于 Text 标签内，即 "EXAMPLE"。

XML 文件的结构如下所示。

<Boxes>

  <Box Id="3" ZIndex="13">
      <Shape>Rectangle</Shape>
      <Brush Id="0" />
      <Pen>
        <Color>#FF000000</Color>

      </Pen>
      <Tag>&lt;?xml version="1.0"?&gt;
&lt;PFDComment xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"&gt;
  &lt;Text&gt;**EXAMPLE** &lt;/Text&gt;

&lt;/PFDComment&gt;</Tag>
  </Box>

</Boxes>

我在下面尝试了一些东西，但无法获得我想要的信息。

def read_cooments(xml):
    tree = lxml.etree.parse(xml)

    Comments= {}
    for comment in tree.xpath("//Boxes/Box"):
    #                                
        get_id = comment.attrib['Id']
        Comments[get_id] = []
        for group in comment.xpath(".//Tag"):
        #                        
            Comments[get_id].append(group.text)

    df_name1 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in Comments.items()]))

任何人都可以帮助从上面显示的 XML 文件中提取注释吗？感谢您的帮助！

Answer 1

使用下面给出的代码：

def read_comments(xml):
    tree = etree.parse(xml)
    rows= []
    for box in tree.xpath('Box'):
        id = box.attrib['Id']
        tagTxt = box.findtext('Tag')
        if tagTxt is None:
            continue
        txtNode = etree.XML(tagTxt).find('Text')
        if txtNode is None:
            continue
        rows.append([id, txtNode.text.strip()])
    return pd.DataFrame(rows, columns=['id', 'Comment'])

请注意，如果您在函数内创建 DataFrame，它是一个 local 此函数的变量，从外部不可见。一个更好、更易读的方法（就像我所做的那样）是函数 returns 这个数据框。

这个函数在两个地方还包含continue，以防止可能的 "error cases"，当 Box 元素不包含 Tag 子元素或 Tag 不包含任何 Text 子元素。

我还注意到没有必要用 < 或 < 或 > 替换 > 使用我自己的代码，因为 lxml 会自行执行。

编辑

我的测试如下：开始表单导入：

import pandas as pd
from lxml import etree

我使用的文件包含：

<Boxes>
  <Box Id="3" ZIndex="13">
    <Shape>Rectangle</Shape>
    <Brush Id="0" />
    <Pen>
      <Color>#FF000000</Color>
    </Pen>
    <Tag>&lt;?xml version="1.0"?&gt;
&lt;PFDComment xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"&gt;
  &lt;Text&gt;**EXAMPLE** &lt;/Text&gt;
&lt;/PFDComment&gt;</Tag>
  </Box>
</Boxes>

我调用了上面的函数：

df_name1 = read_comments('Boxes.xml')

当我打印 df_name1 时，我得到：

  id      Comment
0  3  **EXAMPLE**

如果出现问题，请使用上述功能的"extended"版本，测试打印输出：

def read_comments(xml):
    tree = etree.parse(xml)
    rows= []
    for box in tree.xpath('Box'):
        id = box.attrib['Id']
        tagTxt = box.findtext('Tag')
        if tagTxt is None:
            print('No Tag element')
            continue
        txtNode = etree.XML(tagTxt).find('Text')
        if txtNode is None:
            print('No Text element')
            continue
        txt = txtNode.text.strip()
        print(f'{id}: {txt}')
        rows.append([id, txt])
    return pd.DataFrame(rows, columns=['id', 'Comment'])

并查看打印输出。

从 XML 文件中提取评论 Python

Extracting comments from XML file in Python

python

xml

information-extraction

pandas

编辑