如何通过 Python 中的 LXML 引用父元素并删除 RSS XML 中的父元素？

Question

我一直无法破解这个。我有一个 XML 文件形式的 RSS 提要。简化后，它看起来像这样：

<rss version="2.0">
    <channel>
        <title>My RSS Feed</title>
        <link href="https://www.examplefeedurl.com">Feed</link>
        <description></description>
        <item>...</item>
        <item>...</item>
        <item>...</item>
        <item>
            <guid></guid>
            <pubDate></pubDate>
            <author/>
            <title>Title of the item</title>
            <link href="https://example.com" rel="alternate" type="text/html"/>
            <description>
            <![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
            </description>
            <description>
            <![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
            </description>
        </item>
        <item>...</item>
    </channel>
</rss>

我的objective是检查第二个描述标签是否包含某些字符串。如果它确实包含该字符串，我想将其完全删除。目前在我的代码中我有这个：

doc = lxml.etree.fromstring(testString)
found = doc.findall('channel/item/description')


for desc in found:
    if "FORBIDDENSTRING" in desc.text:
        desc.getparent().remove(desc)

它只删除了第二个描述标签，这是有道理的，但我希望整个 item 消失。如果我只有 'desc' 引用，我不知道如何获取 'item' 元素。

我试过谷歌搜索和搜索，但我看到的情况只是想像我现在做的那样删除标签，奇怪的是我没有偶然发现想要删除整个标签的示例代码父对象。非常欢迎任何指向 documentation/tutorials 或帮助的指示。

Answer 1

考虑一下 XSLT，这种专用语言旨在转换 XML 文件，例如按值有条件地删除节点。 Python 的 lxml 可以运行 XSLT 1.0 脚本，甚至可以将参数从 Python 脚本传递给 XSLT（与在 SQL 中传递参数不同！）。通过这种方式，您可以避免任何 for 循环或 if 逻辑或在应用层重建树。

XSLT (另存为.xsl文件，一个特殊的.xml文件)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" cdata-section-elements="description"/>
  <xsl:strip-space elements="*"/>

  <!-- VALUE TO BE PASSED INTO FROM PYTHON -->
  <xsl:param name="search_string" />       

  <!-- IDENTITY TRANSFORM -->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- KEEP ONLY item NODES THAT DO NOT CONTAIN $search_string -->
  <xsl:template match="channel">
    <xsl:copy>
      <xsl:apply-templates select="item[not(contains(description[2], $search_string))]"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

Python （用于演示，下面运行s 使用发布的示例进行两次搜索）

import lxml.etree as et

# LOAD XML AND XSL
doc = et.parse('Input.xml')
xsl = et.parse('XSLT_String.xsl')

# CONFIGURE TRANSFORMER
transform = et.XSLT(xsl)    

# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('FORBIDDENSTRING')
result = transform(doc, search_string=n)

print(result)
# <?xml version="1.0"?>
# <rss version="2.0">
#   <channel>
#     <item>...</item>
#     <item>...</item>
#     <item>...</item>
#     <item>
#       <guid/>
#       <pubDate/>
#       <author/>
#       <title>Title of the item</title>
#       <link href="https://example.com" rel="alternate" type="text/html"/>
#       <description><![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]></description>
#       <description><![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]></description>
#     </item>
#     <item>...</item>
#   </channel>
# </rss>

# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('bunch of text')
result = transform(doc, search_string=n)

print(result)    
# <?xml version="1.0"?>
# <rss version="2.0">
#   <channel>
#     <item>...</item>
#     <item>...</item>
#     <item>...</item>
#     <item>...</item>
#   </channel>
# </rss>

# SAVE TO FILE
with open('Output.xml', 'wb') as f:
    f.write(result)

Answer 2

我是 XSLT 的忠实粉丝，但另一种选择是 select item 而不是 description（select 您想要的元素删除；不是它的子项）。

此外，如果您使用xpath()，您可以将禁止字符串的检查直接放在xpath谓词中。

示例...

from lxml import etree

testString = """
<rss version="2.0">
    <channel>
        <title>My RSS Feed</title>
        <link href="https://www.examplefeedurl.com">Feed</link>
        <description></description>
        <item>...</item>
        <item>...</item>
        <item>...</item>
        <item>
            <guid></guid>
            <pubDate></pubDate>
            <author/>
            <title>Title of the item</title>
            <link href="https://example.com" rel="alternate" type="text/html"/>
            <description>
            <![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
            </description>
            <description>
            <![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
            </description>
        </item>
        <item>...</item>
    </channel>
</rss>
"""

forbidden_string = "I want to get rid of the whole item"

parser = etree.XMLParser(strip_cdata=False)
doc = etree.fromstring(testString, parser=parser)
found = doc.xpath('.//channel/item[description[contains(.,"{}")]]'.format(forbidden_string))

for item in found:
    item.getparent().remove(item)

print(etree.tostring(doc, encoding="unicode", pretty_print=True))

这会打印...

<rss version="2.0">
    <channel>
        <title>My RSS Feed</title>
        <link href="https://www.examplefeedurl.com">Feed</link>
        <description/>
        <item>...</item>
        <item>...</item>
        <item>...</item>
        <item>...</item>
    </channel>
</rss>

如何通过 Python 中的 LXML 引用父元素并删除 RSS XML 中的父元素？

How can I reference a parent and remove the parent element in an RSS XML through LXML in Python?

python

xml

rss

lxml