如何使用python修改xml文件中嵌套元素的文本？

Question

目前我正在研究 corpus/dataset。它是 xml 格式，如下图所示。我遇到了问题。我想一一访问所有 ‘ne’ 元素，如下图所示。然后我想访问“ne”元素内的“W”元素的文本。然后我想 concatenate 你的符号 'SDi' 和 'EDi' 与这些 'W' 元素的文本。 “i”可以取从 1 开始的任何正整数。在“SDi”的情况下，我只需要“ne”元素内的第一个“W” 元素的 文本。在“EDi”的情况下，我只需要在“ne”元素内的最后一个“W”元素的文本。目前我在运行代码后没有得到任何输出。我认为这是因为元素 'W' 从未被访问过。此外，我认为元素 'W' 未被访问，因为它是元素 'ne' 的孙子，因此无法直接访问它，而是在其父节点的帮助下可能有可能。

注1：ne元素内的子元素个数和名称不一样

注2：这里只说明需要的地方。您可能会在 coding/picture 中找到一些其他详细信息，但请忽略它们。

我正在使用 Spyder (python 3.6) 任何帮助，将不胜感激。

我正在处理的 XML 文件中的一张图片如下：

XML 文件的文本版本： Click here

Sample/Expected输出图像（下）：

到目前为止我完成的编码：

for i in range(len(List_of_root_nodes)): true_false = True current = List_of_root_nodes[i] start_ID = current.PDante_ID #print('start:', start_ID) # For Testing end_ID = None number = str(i+1) # This number will serve as i used with SD and ED that is (SDi and EDi) discourse_starting_symbol = "SD" + number discourse_ending_symbol = "ED" + number while true_false: if current.right_child is None: end_ID = current.PDante_ID #print('end:', end_ID) # For Testing true_false = False else: current = current.right_child # Finding 'ne' element with id='start_ID' ne_text = None ne_id = None for ne in myroot.iter('ne'): ne_id = ne.get('id') # If ne_id matches with start_ID means the place where SDi is to be placed is found if ne_id == start_ID: for w in ne.iter('W'): ne_text = str(w.text) boundary_and_text = " " + str(discourse_starting_symbol) + " " + ne_text w.text = boundary_and_text break # If ne_id matches with end_ID means the place where EDi is to be placed is found # Some changes Required here: Here the 'EDi' will need to be placed after the last 'W' element. # So last 'W' element needs to be accessed if ne_id == end_ID: for w in ne.iter('W'): ne_text = str(w.text) boundary_and_text = ne_text + " " + str(discourse_ending_symbol) + " " w.text = boundary_and_text break

Answer 1

每当您需要根据各种细微的需求修改 XML 时，请考虑 XSLT，这是一种旨在转换 XML 文件的专用语言。您可以运行 XSLT 1.0 脚本与 Python 的第三方模块 lxml（不是内置的 etree）。

具体来说，调用identity transform按原样复制XML，然后运行两个模板将SDI添加到第一个<W>的文本中最后一个 EDI 到最后 <W> 的文本。如果有 10 或 10,000 个 <W> 节点，无论嵌套深度与否，解决方案都有效。

要使用 Whosebug 的顶级 Python 和 XSLT 用户的示例数据进行演示，请参阅 online demo 其中 SDI 和 EDI 添加到第一个和最后一个 <user>节点：

XSLT (另存为.xsl文件，特殊的.xml文件要加载到Python)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>

  <!-- IDENTITY TRANSFORM -->    
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- EDIT FIRST W NODE -->    
  <xsl:template match="W[count(preceding::W)=0]">
    <xsl:copy>
      <xsl:copy-of select="@*"/>
      <xsl:value-of select="concat('SDI ', text())"/>
    </xsl:copy>
  </xsl:template>

  <!-- EDIT LAST W NODE -->    
  <xsl:template match="W[count(preceding::W)+1 = count(//W)]">
    <xsl:copy>
      <xsl:copy-of select="@*"/>
      <xsl:value-of select="concat('EDI ', text())"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

Python（无循环或if/else逻辑）

import lxml.etree as et

doc = et.parse('/path/to/Input.xml')
xsl = et.parse('/path/to/Script.xsl')

# CONFIGURE TRANSFORMER
transform = et.XSLT(xsl)    

# TRANSFORM SOURCE DOC
result = transform(doc)

# OUTPUT TO CONSOLE
print(result)

# SAVE TO FILE
with open('Output.xml', 'wb') as f:
    f.write(result)

Answer 2

像这样（a.xml 是您上传的 XML）：

注意代码没有使用任何外部库。

import xml.etree.ElementTree as ET

SD = 'SD'
ED = 'ED'

root = ET.parse('a.xml')

counter = 1

for ne in root.findall('.//ne'):
    w_lst = ne.findall('.//W')
    if w_lst:
        w_lst[0].text = '{}{} {}'.format(SD, counter, w_lst[0].text)
        if len(w_lst) > 1:
            w_lst[-1].text = '{} {}{}'.format(w_lst[-1].text, ED, counter)
        counter += 1
ET.dump(root)

如何使用python修改xml文件中嵌套元素的文本？

How to modify the text of nested elements in xml file using python?

python

xml

xmlelement