如何使用 Python 解析 CDATA 中的 html？

Question

我从一个看起来像这样的网站得到一个 XML 对象：

<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<partial-response id="j_id1">
    <changes>
        <update id="loginForm:tabelaProcessos">
            <![CDATA[<tr data-ri="5" class="ui-widget-content ui-datatable-odd" role="row"><td role="gridcell" style="word-break:break-all;"><span style="font-size:7pt;text-align: center;" title="XPT">08454.8100</span></td><td role="gridcell"><span style="font-size:7pt;" title="tDFvo">ARÁ</span></td><td role="gridcell"><span style="font-size:7pt;" title="PDSDo">TA15A</span></td><td role="gridcell"><span style="font-size:7pt;" title="P125ão">MINIRAL</span></td><td role="gridcell"><span style="font-size:7pt;" title="A12o">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="O4545ão">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="A45So">- </span></td><td role="gridcell"><span style="font-size:7pt;" title="ASD1vo">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="D45el">18/02/2021 04:35:30</span></td><td role="gridcell"><span style="font-size:7pt;" title="Idto">405833357</span></td></tr>]]>
        </update>
        <update id="j_id1:javax.faces.ViewState:0">
            <![CDATA[-8530455S7417:3382887371AS10732]]>
        </update>
        <extension ln="primefaces" type="args">{"totalRecords":1}</extension>
    </changes>
</partial-response>

我需要解析 CDATA 中的 table 行。我试图将它用作 lxml.html.fromstring() 的输入，但提供的输出忽略了 CDATA 内容。有什么方法可以使用 lxml 或其他 Python lib 获取 CDATA 中的所有内容？

Answer 1

使用BeautifulSoup。 CData 是 NavigableString 的子类。

import bs4

data = """<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<partial-response id="j_id1">
    <changes>
        <update id="loginForm:tabelaProcessos">
            <![CDATA[<tr data-ri="5" class="ui-widget-content ui-datatable-odd" role="row"><td role="gridcell" style="word-break:break-all;"><span style="font-size:7pt;text-align: center;" title="XPT">08454.8100</span></td><td role="gridcell"><span style="font-size:7pt;" title="tDFvo">ARÁ</span></td><td role="gridcell"><span style="font-size:7pt;" title="PDSDo">TA15A</span></td><td role="gridcell"><span style="font-size:7pt;" title="P125ão">MINIRAL</span></td><td role="gridcell"><span style="font-size:7pt;" title="A12o">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="O4545ão">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="A45So">- </span></td><td role="gridcell"><span style="font-size:7pt;" title="ASD1vo">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="D45el">18/02/2021 04:35:30</span></td><td role="gridcell"><span style="font-size:7pt;" title="Idto">405833357</span></td></tr>]]>
        </update>
        <update id="j_id1:javax.faces.ViewState:0">
            <![CDATA[-8530455S7417:3382887371AS10732]]>
        </update>
        <extension ln="primefaces" type="args">{"totalRecords":1}</extension>
    </changes>
</partial-response>"""

soup = bs4.BeautifulSoup(data, 'html.parser')

for cd in soup.findAll(text=True):
    if isinstance(cd, bs4.CData):
        print('CData contents: %r' % cd)

参考：https://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings

如何使用 Python 解析 CDATA 中的 html？

How to parse html inside CDATA using Python?

xml

lxml

python-3.x