如何使用 Python 解析 CDATA 中的 html?
How to parse html inside CDATA using Python?
我从一个看起来像这样的网站得到一个 XML 对象:
<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<partial-response id="j_id1">
<changes>
<update id="loginForm:tabelaProcessos">
<![CDATA[<tr data-ri="5" class="ui-widget-content ui-datatable-odd" role="row"><td role="gridcell" style="word-break:break-all;"><span style="font-size:7pt;text-align: center;" title="XPT">08454.8100</span></td><td role="gridcell"><span style="font-size:7pt;" title="tDFvo">ARÁ</span></td><td role="gridcell"><span style="font-size:7pt;" title="PDSDo">TA15A</span></td><td role="gridcell"><span style="font-size:7pt;" title="P125ão">MINIRAL</span></td><td role="gridcell"><span style="font-size:7pt;" title="A12o">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="O4545ão">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="A45So">- </span></td><td role="gridcell"><span style="font-size:7pt;" title="ASD1vo">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="D45el">18/02/2021 04:35:30</span></td><td role="gridcell"><span style="font-size:7pt;" title="Idto">405833357</span></td></tr>]]>
</update>
<update id="j_id1:javax.faces.ViewState:0">
<![CDATA[-8530455S7417:3382887371AS10732]]>
</update>
<extension ln="primefaces" type="args">{"totalRecords":1}</extension>
</changes>
</partial-response>
我需要解析 CDATA 中的 table 行。我试图将它用作 lxml.html.fromstring()
的输入,但提供的输出忽略了 CDATA 内容。有什么方法可以使用 lxml 或其他 Python lib 获取 CDATA 中的所有内容?
使用BeautifulSoup。 CData 是 NavigableString 的子类。
import bs4
data = """<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<partial-response id="j_id1">
<changes>
<update id="loginForm:tabelaProcessos">
<![CDATA[<tr data-ri="5" class="ui-widget-content ui-datatable-odd" role="row"><td role="gridcell" style="word-break:break-all;"><span style="font-size:7pt;text-align: center;" title="XPT">08454.8100</span></td><td role="gridcell"><span style="font-size:7pt;" title="tDFvo">ARÁ</span></td><td role="gridcell"><span style="font-size:7pt;" title="PDSDo">TA15A</span></td><td role="gridcell"><span style="font-size:7pt;" title="P125ão">MINIRAL</span></td><td role="gridcell"><span style="font-size:7pt;" title="A12o">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="O4545ão">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="A45So">- </span></td><td role="gridcell"><span style="font-size:7pt;" title="ASD1vo">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="D45el">18/02/2021 04:35:30</span></td><td role="gridcell"><span style="font-size:7pt;" title="Idto">405833357</span></td></tr>]]>
</update>
<update id="j_id1:javax.faces.ViewState:0">
<![CDATA[-8530455S7417:3382887371AS10732]]>
</update>
<extension ln="primefaces" type="args">{"totalRecords":1}</extension>
</changes>
</partial-response>"""
soup = bs4.BeautifulSoup(data, 'html.parser')
for cd in soup.findAll(text=True):
if isinstance(cd, bs4.CData):
print('CData contents: %r' % cd)
参考:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings
我从一个看起来像这样的网站得到一个 XML 对象:
<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<partial-response id="j_id1">
<changes>
<update id="loginForm:tabelaProcessos">
<![CDATA[<tr data-ri="5" class="ui-widget-content ui-datatable-odd" role="row"><td role="gridcell" style="word-break:break-all;"><span style="font-size:7pt;text-align: center;" title="XPT">08454.8100</span></td><td role="gridcell"><span style="font-size:7pt;" title="tDFvo">ARÁ</span></td><td role="gridcell"><span style="font-size:7pt;" title="PDSDo">TA15A</span></td><td role="gridcell"><span style="font-size:7pt;" title="P125ão">MINIRAL</span></td><td role="gridcell"><span style="font-size:7pt;" title="A12o">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="O4545ão">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="A45So">- </span></td><td role="gridcell"><span style="font-size:7pt;" title="ASD1vo">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="D45el">18/02/2021 04:35:30</span></td><td role="gridcell"><span style="font-size:7pt;" title="Idto">405833357</span></td></tr>]]>
</update>
<update id="j_id1:javax.faces.ViewState:0">
<![CDATA[-8530455S7417:3382887371AS10732]]>
</update>
<extension ln="primefaces" type="args">{"totalRecords":1}</extension>
</changes>
</partial-response>
我需要解析 CDATA 中的 table 行。我试图将它用作 lxml.html.fromstring()
的输入,但提供的输出忽略了 CDATA 内容。有什么方法可以使用 lxml 或其他 Python lib 获取 CDATA 中的所有内容?
使用BeautifulSoup。 CData 是 NavigableString 的子类。
import bs4
data = """<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<partial-response id="j_id1">
<changes>
<update id="loginForm:tabelaProcessos">
<![CDATA[<tr data-ri="5" class="ui-widget-content ui-datatable-odd" role="row"><td role="gridcell" style="word-break:break-all;"><span style="font-size:7pt;text-align: center;" title="XPT">08454.8100</span></td><td role="gridcell"><span style="font-size:7pt;" title="tDFvo">ARÁ</span></td><td role="gridcell"><span style="font-size:7pt;" title="PDSDo">TA15A</span></td><td role="gridcell"><span style="font-size:7pt;" title="P125ão">MINIRAL</span></td><td role="gridcell"><span style="font-size:7pt;" title="A12o">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="O4545ão">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="A45So">- </span></td><td role="gridcell"><span style="font-size:7pt;" title="ASD1vo">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="D45el">18/02/2021 04:35:30</span></td><td role="gridcell"><span style="font-size:7pt;" title="Idto">405833357</span></td></tr>]]>
</update>
<update id="j_id1:javax.faces.ViewState:0">
<![CDATA[-8530455S7417:3382887371AS10732]]>
</update>
<extension ln="primefaces" type="args">{"totalRecords":1}</extension>
</changes>
</partial-response>"""
soup = bs4.BeautifulSoup(data, 'html.parser')
for cd in soup.findAll(text=True):
if isinstance(cd, bs4.CData):
print('CData contents: %r' % cd)
参考:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings