Python 阶段 XML 通过某些属性删除元素并替换属性中的文本

Python phase XML removing element by certain attributes and replace text within attributes

我有以下 XML 文件:

<tv>
    <programme channel="BBC Red Button 1" start="20180422123000 +0000" stop="20180422125500 +0000">
        <title lang="en">Live Snooker: The World Championship: Day Two - 2018</title>
        <desc lang="en">Coverage of day two at the Crucible Theatre in Sheffield</desc>
        <category lang="en">Sport</category>
        <icon src="http://images.radiotimes.com/remote/static.radiotimes.com.edgesuite.net/pa/70/26/webANXsnookerlivebbc.jpg?quality=60&amp;mode=crop&amp;width=130&amp;height=100&amp;404=tv" />
    </programme>
    <programme channel="BBC Red Button 1" start="20180422125500 +0000" stop="20180422150000 +0000">
        <title lang="en">Live UEFA Women's Champions League</title>
        <desc lang="en">Manchester City v Lyon (Kick-off 1.00pm)</desc>
        <category lang="en">Sport</category>
        <icon src="http://images.radiotimes.com/assets/images/holding/tv.png?quality=60&amp;mode=crop&amp;width=130&amp;height=100&amp;404=tv" />
     </programme>
</tv>

首先我试图删除 src 等于

的元素图标
<icon src="http://images.radiotimes.com/assets/images/holding/tv.png?quality=60&amp;mode=crop&amp;width=130&amp;height=100&amp;404=tv" />

然后对于剩余的图标,我试图用 quality=100&amp;mode=crop&amp;width=1200&amp;height=723

替换 quality=60&amp;mode=crop&amp;width=130&amp;height=100

因此,一旦 XML 文件分阶段,它将如下所示:

<tv>
    <programme channel="BBC Red Button 1" start="20180422123000 +0000" stop="20180422125500 +0000">
        <title lang="en">Live Snooker: The World Championship: Day Two - 2018</title>
        <desc lang="en">Coverage of day two at the Crucible Theatre in Sheffield</desc>
        <category lang="en">Sport</category>
        <icon src="http://images.radiotimes.com/remote/static.radiotimes.com.edgesuite.net/pa/70/26/webANXsnookerlivebbc.jpg?quality=100&amp;mode=crop&amp;width=1200&amp;height=723&amp;404=tv" />
    </programme>
    <programme channel="BBC Red Button 1" start="20180422125500 +0000" stop="20180422150000 +0000">
        <title lang="en">Live UEFA Women's Champions League</title>
        <desc lang="en">Manchester City v Lyon (Kick-off 1.00pm)</desc>
        <category lang="en">Sport</category>
     </programme>
</tv>

我首先需要在 XML 文件中删除我不想要的图标,然后再替换其他值,所以我最终不会更改我想删除的图标的值,到目前为止,我已经尝试了以下方法来删除图标,但我没有成功:

#!/bin/sh

from xml.etree.ElementTree import ElementTree

t = ElementTree()
t.parse('/volume1/TVMosaic/Freeview-WG++/guide.xml')
programmeList = t.findall('tv/programme/icon')
for programmeEl in programmeList:
    if programmeEl.attrib['src'] in ('http://images.radiotimes.com/assets/images/holding/tv.png?quality=60&amp;mode=crop&amp;width=130&amp;height=100&amp;404=tv') and \
            programmeEl.attrib['src'] == programmeEl.text:
        del programmeEl.attrib['src']
t.write('/volume1/TVMosaic/Freeview-WG++/PhasedGuide.xml')

有人可以帮我删除具有我提到的那个 src 的图标,然后用我之前提到的值替换其余图标中的值。

谢谢。

问题是您要查找的字符串是 XML escaped(注意“&amp;"s),而在解析文件时,字符串未转义 (&amp; 转换为 & - 以及其他一些)。有关详细信息,请查看 [Python.Wiki]: Escaping XML.

code.py:

#!/usr/bin/env python3

import sys
from xml.etree import ElementTree as ET
from xml.sax.saxutils import escape, unescape


INPUT_FILE_NAME = "guide.xml"
OUTPUT_FILE_NAME = "PhasedGuide.xml"
SRC_ATTR_TEXT = "http://images.radiotimes.com/assets/images/holding/tv.png?quality=60&amp;mode=crop&amp;width=130&amp;height=100&amp;404=tv"
SRC_ATTR_REPLACE_TEXT = "quality=60&amp;mode=crop&amp;width=130&amp;height=100"
SRC_ATTR_REPLACE_WITH_TEXT = "quality=100&amp;mode=crop&amp;width=1200&amp;height=723"


def main():
    tree = ET.parse(INPUT_FILE_NAME)
    tv_node = tree.getroot()
    for programme_node in tv_node.findall("programme"):
        icon_node = programme_node.find("icon")
        if icon_node is not None:
            print(icon_node.get("src", ""))
            src_attr = escape(icon_node.get("src", ""))
            if src_attr == SRC_ATTR_TEXT:
                programme_node.remove(icon_node)
            elif src_attr:
                icon_node.set("src", unescape(src_attr.replace(SRC_ATTR_REPLACE_TEXT, SRC_ATTR_REPLACE_WITH_TEXT)))
    tree.write(OUTPUT_FILE_NAME)


if __name__ == "__main__":
    print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
    main()

备注:

  • 算法加载并解析文件,得到根节点(tv)
  • 遍历其所有 程序 子程序
  • 对于每个,尝试找到一个 icon 子节点,如果找到则获取其 src 属性(值为 逃脱)
  • 然后,根据属性(转义)值,它执行所需的操作

输出:

(py35x64_test) e:\Work\Dev\Whosebug\q049967927>"e:\Work\Dev\VEnvs\py35x64_test\Scripts\python.exe" code.py
Python 3.5.4 (v3.5.4:3f56838, Aug  8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32


(py35x64_test) e:\Work\Dev\Whosebug\q049967927>type PhasedGuide.xml
<tv>
    <programme channel="BBC Red Button 1" start="20180422123000 +0000" stop="20180422125500 +0000">
        <title lang="en">Live Snooker: The World Championship: Day Two - 2018</title>
        <desc lang="en">Coverage of day two at the Crucible Theatre in Sheffield</desc>
        <category lang="en">Sport</category>
        <icon src="http://images.radiotimes.com/remote/static.radiotimes.com.edgesuite.net/pa/70/26/webANXsnookerlivebbc.jpg?quality=100&amp;mode=crop&amp;width=1200&amp;height=723&amp;404=tv" />
    </programme>
    <programme channel="BBC Red Button 1" start="20180422125500 +0000" stop="20180422150000 +0000">
        <title lang="en">Live UEFA Women's Champions League</title>
        <desc lang="en">Manchester City v Lyon (Kick-off 1.00pm)</desc>
        <category lang="en">Sport</category>
        </programme>
</tv>