在 Python 中使用 lxml,我需要在输入 xml 文件中用 <mark>RNA</mark> 替换 "RNA"。下面的代码

Using lxml in Python, I need to replace "RNA" with <mark>RNA</mark> in input xml file. Code below

My input XML file is:

<?xml version='1.0' encoding='UTF-8'?>
<try>
something somethingRNA and RNA in RNA.
</try> 

My Python Code:

import lxml.etree as ET
import openpyxl
import re

url = 'output_15012015_test.xml'

tree = ET.parse(url)

lncrna = "RNA"
abstract = tree.xpath('//try)

string = abstract[0].text

if(abstract):
        anotherString = re.sub(r'\b'+lncrna.lower()+'\b', '<mark>'+lncrna+'</mark>', string.lower())

abstract[0].text = anotherString
print abstract[0].text
tree.write('FalseRoller.xml', encoding='UTF-8', pretty_print=True)

Output

我得到以下替换文本而不是 <mark>RNA</mark>

 &lt;mark&gt;RNA&lt;/mark&gt;

I think it has to do with tree.write() method. Also I'm new to Python and the community. Please help me with this.

您在元素 .text 中设置了 XML 标记,因此当写入 XML 时,它被解释为文本,而不是标记,并且字符使用 [=12= 进行转义].

您想做的是:

  • .text分为三个部分:新标签之前,新标签中, 新标签后
  • 添加新标签并设置文本和尾巴

见代码:

tree = ET.parse(url)

lncrna = "RNA"
abstract = tree.xpath('//try')

aList = re.split(r'(\b'+lncrna+r'\b)', abstract[0].text, flags=re.IGNORECASE)

abstract[0].text = aList[0]
for i in range(1,len(aList),2):
  anElement = ET.SubElement(abstract[0], 'mark')
  anElement.text = aList[i]
  anElement.tail = aList[i+1]
  abstract[0].insert( (i-1)/2, anElement )

print abstract[0].text
tree.write('FalseRoller.xml', encoding='UTF-8', pretty_print=True)