Python 阶段 XML 通过某些属性删除元素并替换属性中的文本
Python phase XML removing element by certain attributes and replace text within attributes
我有以下 XML 文件:
<tv>
<programme channel="BBC Red Button 1" start="20180422123000 +0000" stop="20180422125500 +0000">
<title lang="en">Live Snooker: The World Championship: Day Two - 2018</title>
<desc lang="en">Coverage of day two at the Crucible Theatre in Sheffield</desc>
<category lang="en">Sport</category>
<icon src="http://images.radiotimes.com/remote/static.radiotimes.com.edgesuite.net/pa/70/26/webANXsnookerlivebbc.jpg?quality=60&mode=crop&width=130&height=100&404=tv" />
</programme>
<programme channel="BBC Red Button 1" start="20180422125500 +0000" stop="20180422150000 +0000">
<title lang="en">Live UEFA Women's Champions League</title>
<desc lang="en">Manchester City v Lyon (Kick-off 1.00pm)</desc>
<category lang="en">Sport</category>
<icon src="http://images.radiotimes.com/assets/images/holding/tv.png?quality=60&mode=crop&width=130&height=100&404=tv" />
</programme>
</tv>
首先我试图删除 src 等于
的元素图标
<icon src="http://images.radiotimes.com/assets/images/holding/tv.png?quality=60&mode=crop&width=130&height=100&404=tv" />
然后对于剩余的图标,我试图用 quality=100&mode=crop&width=1200&height=723
替换 quality=60&mode=crop&width=130&height=100
因此,一旦 XML 文件分阶段,它将如下所示:
<tv>
<programme channel="BBC Red Button 1" start="20180422123000 +0000" stop="20180422125500 +0000">
<title lang="en">Live Snooker: The World Championship: Day Two - 2018</title>
<desc lang="en">Coverage of day two at the Crucible Theatre in Sheffield</desc>
<category lang="en">Sport</category>
<icon src="http://images.radiotimes.com/remote/static.radiotimes.com.edgesuite.net/pa/70/26/webANXsnookerlivebbc.jpg?quality=100&mode=crop&width=1200&height=723&404=tv" />
</programme>
<programme channel="BBC Red Button 1" start="20180422125500 +0000" stop="20180422150000 +0000">
<title lang="en">Live UEFA Women's Champions League</title>
<desc lang="en">Manchester City v Lyon (Kick-off 1.00pm)</desc>
<category lang="en">Sport</category>
</programme>
</tv>
我首先需要在 XML 文件中删除我不想要的图标,然后再替换其他值,所以我最终不会更改我想删除的图标的值,到目前为止,我已经尝试了以下方法来删除图标,但我没有成功:
#!/bin/sh
from xml.etree.ElementTree import ElementTree
t = ElementTree()
t.parse('/volume1/TVMosaic/Freeview-WG++/guide.xml')
programmeList = t.findall('tv/programme/icon')
for programmeEl in programmeList:
if programmeEl.attrib['src'] in ('http://images.radiotimes.com/assets/images/holding/tv.png?quality=60&mode=crop&width=130&height=100&404=tv') and \
programmeEl.attrib['src'] == programmeEl.text:
del programmeEl.attrib['src']
t.write('/volume1/TVMosaic/Freeview-WG++/PhasedGuide.xml')
有人可以帮我删除具有我提到的那个 src 的图标,然后用我之前提到的值替换其余图标中的值。
谢谢。
问题是您要查找的字符串是 XML escaped(注意“&"s),而在解析文件时,字符串未转义 (& 转换为 & - 以及其他一些)。有关详细信息,请查看 [Python.Wiki]: Escaping XML.
code.py:
#!/usr/bin/env python3
import sys
from xml.etree import ElementTree as ET
from xml.sax.saxutils import escape, unescape
INPUT_FILE_NAME = "guide.xml"
OUTPUT_FILE_NAME = "PhasedGuide.xml"
SRC_ATTR_TEXT = "http://images.radiotimes.com/assets/images/holding/tv.png?quality=60&mode=crop&width=130&height=100&404=tv"
SRC_ATTR_REPLACE_TEXT = "quality=60&mode=crop&width=130&height=100"
SRC_ATTR_REPLACE_WITH_TEXT = "quality=100&mode=crop&width=1200&height=723"
def main():
tree = ET.parse(INPUT_FILE_NAME)
tv_node = tree.getroot()
for programme_node in tv_node.findall("programme"):
icon_node = programme_node.find("icon")
if icon_node is not None:
print(icon_node.get("src", ""))
src_attr = escape(icon_node.get("src", ""))
if src_attr == SRC_ATTR_TEXT:
programme_node.remove(icon_node)
elif src_attr:
icon_node.set("src", unescape(src_attr.replace(SRC_ATTR_REPLACE_TEXT, SRC_ATTR_REPLACE_WITH_TEXT)))
tree.write(OUTPUT_FILE_NAME)
if __name__ == "__main__":
print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
main()
备注:
- 算法加载并解析文件,得到根节点(tv)
- 遍历其所有 程序 子程序
- 对于每个,尝试找到一个 icon 子节点,如果找到则获取其 src 属性(值为 逃脱)
- 然后,根据属性(转义)值,它执行所需的操作
输出:
(py35x64_test) e:\Work\Dev\Whosebug\q049967927>"e:\Work\Dev\VEnvs\py35x64_test\Scripts\python.exe" code.py
Python 3.5.4 (v3.5.4:3f56838, Aug 8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32
(py35x64_test) e:\Work\Dev\Whosebug\q049967927>type PhasedGuide.xml
<tv>
<programme channel="BBC Red Button 1" start="20180422123000 +0000" stop="20180422125500 +0000">
<title lang="en">Live Snooker: The World Championship: Day Two - 2018</title>
<desc lang="en">Coverage of day two at the Crucible Theatre in Sheffield</desc>
<category lang="en">Sport</category>
<icon src="http://images.radiotimes.com/remote/static.radiotimes.com.edgesuite.net/pa/70/26/webANXsnookerlivebbc.jpg?quality=100&mode=crop&width=1200&height=723&404=tv" />
</programme>
<programme channel="BBC Red Button 1" start="20180422125500 +0000" stop="20180422150000 +0000">
<title lang="en">Live UEFA Women's Champions League</title>
<desc lang="en">Manchester City v Lyon (Kick-off 1.00pm)</desc>
<category lang="en">Sport</category>
</programme>
</tv>
我有以下 XML 文件:
<tv>
<programme channel="BBC Red Button 1" start="20180422123000 +0000" stop="20180422125500 +0000">
<title lang="en">Live Snooker: The World Championship: Day Two - 2018</title>
<desc lang="en">Coverage of day two at the Crucible Theatre in Sheffield</desc>
<category lang="en">Sport</category>
<icon src="http://images.radiotimes.com/remote/static.radiotimes.com.edgesuite.net/pa/70/26/webANXsnookerlivebbc.jpg?quality=60&mode=crop&width=130&height=100&404=tv" />
</programme>
<programme channel="BBC Red Button 1" start="20180422125500 +0000" stop="20180422150000 +0000">
<title lang="en">Live UEFA Women's Champions League</title>
<desc lang="en">Manchester City v Lyon (Kick-off 1.00pm)</desc>
<category lang="en">Sport</category>
<icon src="http://images.radiotimes.com/assets/images/holding/tv.png?quality=60&mode=crop&width=130&height=100&404=tv" />
</programme>
</tv>
首先我试图删除 src 等于
的元素图标<icon src="http://images.radiotimes.com/assets/images/holding/tv.png?quality=60&mode=crop&width=130&height=100&404=tv" />
然后对于剩余的图标,我试图用 quality=100&mode=crop&width=1200&height=723
quality=60&mode=crop&width=130&height=100
因此,一旦 XML 文件分阶段,它将如下所示:
<tv>
<programme channel="BBC Red Button 1" start="20180422123000 +0000" stop="20180422125500 +0000">
<title lang="en">Live Snooker: The World Championship: Day Two - 2018</title>
<desc lang="en">Coverage of day two at the Crucible Theatre in Sheffield</desc>
<category lang="en">Sport</category>
<icon src="http://images.radiotimes.com/remote/static.radiotimes.com.edgesuite.net/pa/70/26/webANXsnookerlivebbc.jpg?quality=100&mode=crop&width=1200&height=723&404=tv" />
</programme>
<programme channel="BBC Red Button 1" start="20180422125500 +0000" stop="20180422150000 +0000">
<title lang="en">Live UEFA Women's Champions League</title>
<desc lang="en">Manchester City v Lyon (Kick-off 1.00pm)</desc>
<category lang="en">Sport</category>
</programme>
</tv>
我首先需要在 XML 文件中删除我不想要的图标,然后再替换其他值,所以我最终不会更改我想删除的图标的值,到目前为止,我已经尝试了以下方法来删除图标,但我没有成功:
#!/bin/sh
from xml.etree.ElementTree import ElementTree
t = ElementTree()
t.parse('/volume1/TVMosaic/Freeview-WG++/guide.xml')
programmeList = t.findall('tv/programme/icon')
for programmeEl in programmeList:
if programmeEl.attrib['src'] in ('http://images.radiotimes.com/assets/images/holding/tv.png?quality=60&mode=crop&width=130&height=100&404=tv') and \
programmeEl.attrib['src'] == programmeEl.text:
del programmeEl.attrib['src']
t.write('/volume1/TVMosaic/Freeview-WG++/PhasedGuide.xml')
有人可以帮我删除具有我提到的那个 src 的图标,然后用我之前提到的值替换其余图标中的值。
谢谢。
问题是您要查找的字符串是 XML escaped(注意“&"s),而在解析文件时,字符串未转义 (& 转换为 & - 以及其他一些)。有关详细信息,请查看 [Python.Wiki]: Escaping XML.
code.py:
#!/usr/bin/env python3
import sys
from xml.etree import ElementTree as ET
from xml.sax.saxutils import escape, unescape
INPUT_FILE_NAME = "guide.xml"
OUTPUT_FILE_NAME = "PhasedGuide.xml"
SRC_ATTR_TEXT = "http://images.radiotimes.com/assets/images/holding/tv.png?quality=60&mode=crop&width=130&height=100&404=tv"
SRC_ATTR_REPLACE_TEXT = "quality=60&mode=crop&width=130&height=100"
SRC_ATTR_REPLACE_WITH_TEXT = "quality=100&mode=crop&width=1200&height=723"
def main():
tree = ET.parse(INPUT_FILE_NAME)
tv_node = tree.getroot()
for programme_node in tv_node.findall("programme"):
icon_node = programme_node.find("icon")
if icon_node is not None:
print(icon_node.get("src", ""))
src_attr = escape(icon_node.get("src", ""))
if src_attr == SRC_ATTR_TEXT:
programme_node.remove(icon_node)
elif src_attr:
icon_node.set("src", unescape(src_attr.replace(SRC_ATTR_REPLACE_TEXT, SRC_ATTR_REPLACE_WITH_TEXT)))
tree.write(OUTPUT_FILE_NAME)
if __name__ == "__main__":
print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
main()
备注:
- 算法加载并解析文件,得到根节点(tv)
- 遍历其所有 程序 子程序
- 对于每个,尝试找到一个 icon 子节点,如果找到则获取其 src 属性(值为 逃脱)
- 然后,根据属性(转义)值,它执行所需的操作
输出:
(py35x64_test) e:\Work\Dev\Whosebug\q049967927>"e:\Work\Dev\VEnvs\py35x64_test\Scripts\python.exe" code.py Python 3.5.4 (v3.5.4:3f56838, Aug 8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32 (py35x64_test) e:\Work\Dev\Whosebug\q049967927>type PhasedGuide.xml <tv> <programme channel="BBC Red Button 1" start="20180422123000 +0000" stop="20180422125500 +0000"> <title lang="en">Live Snooker: The World Championship: Day Two - 2018</title> <desc lang="en">Coverage of day two at the Crucible Theatre in Sheffield</desc> <category lang="en">Sport</category> <icon src="http://images.radiotimes.com/remote/static.radiotimes.com.edgesuite.net/pa/70/26/webANXsnookerlivebbc.jpg?quality=100&mode=crop&width=1200&height=723&404=tv" /> </programme> <programme channel="BBC Red Button 1" start="20180422125500 +0000" stop="20180422150000 +0000"> <title lang="en">Live UEFA Women's Champions League</title> <desc lang="en">Manchester City v Lyon (Kick-off 1.00pm)</desc> <category lang="en">Sport</category> </programme> </tv>