如何通过 Python 中的 LXML 引用父元素并删除 RSS XML 中的父元素?
How can I reference a parent and remove the parent element in an RSS XML through LXML in Python?
我一直无法破解这个。我有一个 XML 文件形式的 RSS 提要。简化后,它看起来像这样:
<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description></description>
<item>...</item>
<item>...</item>
<item>...</item>
<item>
<guid></guid>
<pubDate></pubDate>
<author/>
<title>Title of the item</title>
<link href="https://example.com" rel="alternate" type="text/html"/>
<description>
<![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
</description>
<description>
<![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
</description>
</item>
<item>...</item>
</channel>
</rss>
我的objective是检查第二个描述标签是否包含某些字符串。如果它确实包含该字符串,我想将其完全删除。目前在我的代码中我有这个:
doc = lxml.etree.fromstring(testString)
found = doc.findall('channel/item/description')
for desc in found:
if "FORBIDDENSTRING" in desc.text:
desc.getparent().remove(desc)
它只删除了第二个描述标签,这是有道理的,但我希望整个 item 消失。
如果我只有 'desc' 引用,我不知道如何获取 'item' 元素。
我试过谷歌搜索和搜索,但我看到的情况只是想像我现在做的那样删除标签,奇怪的是我没有偶然发现想要删除整个标签的示例代码父对象。
非常欢迎任何指向 documentation/tutorials 或帮助的指示。
考虑一下 XSLT,这种专用语言旨在转换 XML 文件,例如按值有条件地删除节点。 Python 的 lxml
可以 运行 XSLT 1.0 脚本,甚至可以将参数从 Python 脚本传递给 XSLT(与在 SQL 中传递参数不同!)。通过这种方式,您可以避免任何 for
循环或 if
逻辑或在应用层重建树。
XSLT (另存为.xsl文件,一个特殊的.xml文件)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" cdata-section-elements="description"/>
<xsl:strip-space elements="*"/>
<!-- VALUE TO BE PASSED INTO FROM PYTHON -->
<xsl:param name="search_string" />
<!-- IDENTITY TRANSFORM -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- KEEP ONLY item NODES THAT DO NOT CONTAIN $search_string -->
<xsl:template match="channel">
<xsl:copy>
<xsl:apply-templates select="item[not(contains(description[2], $search_string))]"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Python (用于演示,下面 运行s 使用发布的示例进行两次搜索)
import lxml.etree as et
# LOAD XML AND XSL
doc = et.parse('Input.xml')
xsl = et.parse('XSLT_String.xsl')
# CONFIGURE TRANSFORMER
transform = et.XSLT(xsl)
# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('FORBIDDENSTRING')
result = transform(doc, search_string=n)
print(result)
# <?xml version="1.0"?>
# <rss version="2.0">
# <channel>
# <item>...</item>
# <item>...</item>
# <item>...</item>
# <item>
# <guid/>
# <pubDate/>
# <author/>
# <title>Title of the item</title>
# <link href="https://example.com" rel="alternate" type="text/html"/>
# <description><![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]></description>
# <description><![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]></description>
# </item>
# <item>...</item>
# </channel>
# </rss>
# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('bunch of text')
result = transform(doc, search_string=n)
print(result)
# <?xml version="1.0"?>
# <rss version="2.0">
# <channel>
# <item>...</item>
# <item>...</item>
# <item>...</item>
# <item>...</item>
# </channel>
# </rss>
# SAVE TO FILE
with open('Output.xml', 'wb') as f:
f.write(result)
我是 XSLT 的忠实粉丝,但另一种选择是 select item
而不是 description
(select 您想要的元素删除;不是它的子项)。
此外,如果您使用xpath()
,您可以将禁止字符串的检查直接放在xpath谓词中。
示例...
from lxml import etree
testString = """
<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description></description>
<item>...</item>
<item>...</item>
<item>...</item>
<item>
<guid></guid>
<pubDate></pubDate>
<author/>
<title>Title of the item</title>
<link href="https://example.com" rel="alternate" type="text/html"/>
<description>
<![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
</description>
<description>
<![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
</description>
</item>
<item>...</item>
</channel>
</rss>
"""
forbidden_string = "I want to get rid of the whole item"
parser = etree.XMLParser(strip_cdata=False)
doc = etree.fromstring(testString, parser=parser)
found = doc.xpath('.//channel/item[description[contains(.,"{}")]]'.format(forbidden_string))
for item in found:
item.getparent().remove(item)
print(etree.tostring(doc, encoding="unicode", pretty_print=True))
这会打印...
<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description/>
<item>...</item>
<item>...</item>
<item>...</item>
<item>...</item>
</channel>
</rss>
我一直无法破解这个。我有一个 XML 文件形式的 RSS 提要。简化后,它看起来像这样:
<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description></description>
<item>...</item>
<item>...</item>
<item>...</item>
<item>
<guid></guid>
<pubDate></pubDate>
<author/>
<title>Title of the item</title>
<link href="https://example.com" rel="alternate" type="text/html"/>
<description>
<![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
</description>
<description>
<![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
</description>
</item>
<item>...</item>
</channel>
</rss>
我的objective是检查第二个描述标签是否包含某些字符串。如果它确实包含该字符串,我想将其完全删除。目前在我的代码中我有这个:
doc = lxml.etree.fromstring(testString)
found = doc.findall('channel/item/description')
for desc in found:
if "FORBIDDENSTRING" in desc.text:
desc.getparent().remove(desc)
它只删除了第二个描述标签,这是有道理的,但我希望整个 item 消失。 如果我只有 'desc' 引用,我不知道如何获取 'item' 元素。
我试过谷歌搜索和搜索,但我看到的情况只是想像我现在做的那样删除标签,奇怪的是我没有偶然发现想要删除整个标签的示例代码父对象。 非常欢迎任何指向 documentation/tutorials 或帮助的指示。
考虑一下 XSLT,这种专用语言旨在转换 XML 文件,例如按值有条件地删除节点。 Python 的 lxml
可以 运行 XSLT 1.0 脚本,甚至可以将参数从 Python 脚本传递给 XSLT(与在 SQL 中传递参数不同!)。通过这种方式,您可以避免任何 for
循环或 if
逻辑或在应用层重建树。
XSLT (另存为.xsl文件,一个特殊的.xml文件)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" cdata-section-elements="description"/>
<xsl:strip-space elements="*"/>
<!-- VALUE TO BE PASSED INTO FROM PYTHON -->
<xsl:param name="search_string" />
<!-- IDENTITY TRANSFORM -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- KEEP ONLY item NODES THAT DO NOT CONTAIN $search_string -->
<xsl:template match="channel">
<xsl:copy>
<xsl:apply-templates select="item[not(contains(description[2], $search_string))]"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Python (用于演示,下面 运行s 使用发布的示例进行两次搜索)
import lxml.etree as et
# LOAD XML AND XSL
doc = et.parse('Input.xml')
xsl = et.parse('XSLT_String.xsl')
# CONFIGURE TRANSFORMER
transform = et.XSLT(xsl)
# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('FORBIDDENSTRING')
result = transform(doc, search_string=n)
print(result)
# <?xml version="1.0"?>
# <rss version="2.0">
# <channel>
# <item>...</item>
# <item>...</item>
# <item>...</item>
# <item>
# <guid/>
# <pubDate/>
# <author/>
# <title>Title of the item</title>
# <link href="https://example.com" rel="alternate" type="text/html"/>
# <description><![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]></description>
# <description><![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]></description>
# </item>
# <item>...</item>
# </channel>
# </rss>
# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('bunch of text')
result = transform(doc, search_string=n)
print(result)
# <?xml version="1.0"?>
# <rss version="2.0">
# <channel>
# <item>...</item>
# <item>...</item>
# <item>...</item>
# <item>...</item>
# </channel>
# </rss>
# SAVE TO FILE
with open('Output.xml', 'wb') as f:
f.write(result)
我是 XSLT 的忠实粉丝,但另一种选择是 select item
而不是 description
(select 您想要的元素删除;不是它的子项)。
此外,如果您使用xpath()
,您可以将禁止字符串的检查直接放在xpath谓词中。
示例...
from lxml import etree
testString = """
<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description></description>
<item>...</item>
<item>...</item>
<item>...</item>
<item>
<guid></guid>
<pubDate></pubDate>
<author/>
<title>Title of the item</title>
<link href="https://example.com" rel="alternate" type="text/html"/>
<description>
<![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
</description>
<description>
<![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
</description>
</item>
<item>...</item>
</channel>
</rss>
"""
forbidden_string = "I want to get rid of the whole item"
parser = etree.XMLParser(strip_cdata=False)
doc = etree.fromstring(testString, parser=parser)
found = doc.xpath('.//channel/item[description[contains(.,"{}")]]'.format(forbidden_string))
for item in found:
item.getparent().remove(item)
print(etree.tostring(doc, encoding="unicode", pretty_print=True))
这会打印...
<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description/>
<item>...</item>
<item>...</item>
<item>...</item>
<item>...</item>
</channel>
</rss>