如何在 Beautiful Soup 4 中插入未转义的 html 片段

How to insert unescaped html fragment in Beautiful Soup 4

我必须解析一些令人讨厌的政府创建的 html (http://www.spokanecounty.org/detentionservices/inmateroster/detail2.aspx?sysid=84060) 并且为了减轻我的痛苦我想在文档中插入一些 html 片段以将一些内容包装到更容易消化的块。

然而,

BS4 转义了我要插入的 html 字符串片段 (<div class="case">) 并将其转换为:

&lt;div class="case"&gt;

我解析的相关html是这样的:

<div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
    &nbsp;
</div>
<div style='width:45%; float:left;'>
    <h2 style='margin-top:0px;' rel=121018261>Case Number: 121018261</h2>
</div>
<div style='width:45%;float:right; text-align:right;'>
    <div>Added: 10/22/2012</div>
</div>
<div style='width:100%;clear:both;'>
    <b>Case Bond:</b> ,000,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
</div>
<table class='bookinfo 121018261' style='width:100%;'>
    <tr><td><b>Charge 1  <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG <br /> <b>Report Number:</b> 120160423 <b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
    <tr><td><b>Charge 2 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br /><b>Report Number:</b> 120160423<b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
</table>

<div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
    &nbsp;
</div>
<div style='width:45%; float:left;'>
    <h2 style='margin-top:0px;' rel=121037010>Case Number: 121037010</h2>
</div>
<div style='width:45%;float:right; text-align:right;'>
    <div>Added: 10/21/2012</div>
</div>
<div style='width:100%;clear:both;'>
    <b>Case Bond:</b> 0,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
</div>
<table class='bookinfo 121037010' style='width:100%;'>
    <tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050' target='_blank'>RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE) <br /> <b>Report Number:</b> 120345597 <b style='margin-left:10px;'>Report Agency:</b> 001 - SPOKANE COUNTY</td></tr>
</table>

Python 代码如下所示:

case_top = soup.find_all(style=re.compile("border-top:solid 1px #666"))
for c in case_top:
    c.insert_before(soup.new_string('<div class="case">'))
case_bottom = soup.find_all("table", class_="bookinfo")
for c in case_bottom:
    c.insert_after(soup.new_string('</div'))

结果如下所示:

&lt;div class="case"&gt;<div style="float:left; width:100%;border-top:solid 1px #666;height:5px;"> </div><div style="width:45%; float:left;"><h2 rel="121018261" style="margin-top:0px;">Case Number: 121018261</h2></div><div style="width:45%;float:right; text-align:right;"><div>Added: 10/22/2012</div></div><div style="width:100%;clear:both;"><b>Case Bond:</b> ,000,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court</div><table class="bookinfo 121018261" style="width:100%;"><tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br/><b>Report Number:</b> 120160423 <b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr><tr><td><b>Charge 2 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br/><b>Report Number:</b> 120160423 <b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr></table>&lt;/div&gt;&lt;div class="case"&gt;<div style="float:left; width:100%;border-top:solid 1px #666;height:5px;"> </div><div style="width:45%; float:left;"><h2 rel="121037010" style="margin-top:0px;">Case Number: 121037010</h2></div><div style="width:45%;float:right; text-align:right;"><div>Added: 10/21/2012</div></div><div style="width:100%;clear:both;"><b>Case Bond:</b> 0,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court</div><table class="bookinfo 121037010" style="width:100%;"><tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050" target="_blank">RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE)<br/><b>Report Number:</b> 120345597 <b style="margin-left:10px;">Report Agency:</b> 001 - SPOKANE COUNTY</td></tr></table>&lt;/div&gt;

那么问题来了,如何将未转义的 html 片段插入到文档中?

你告诉 BeautifulSoup 插入 字符串数据:

c.insert_before(soup.new_string('<div class="case">'))

任何对 HTML 字符串数据不安全的内容都会被转义。您反而想插入一个 标签对象 :

c.insert_before(soup.new_tag('div', **{'class': 'case'}))

这将创建一个新的子元素,它实际上并没有包装任何东西。

如果您想将每个单独的元素包装在其中,您可以使用 Element.wrap() method:

c.wrap(soup.new_tag('div', **{'class': 'case'}))

但这一次只对一个标签有效。

为了包裹系列标签,唯一要做的就是移动标签;将位于一个地方的标签插入另一个地方可以有效地将它们移动过来:

case_top = soup.find_all(style=re.compile("border-top:solid 1px #666"))
for case in case_top:
    wrapper = soup.new_tag('div', **{'class': 'case'})
    case.insert_before(wrapper)
    while wrapper.next_sibling:
        wrapper.append(wrapper.next_sibling)
        if wrapper.find('table', class_='bookinfo'):
            # moved over the bookinfo table, time to stop
            break

然后将 case_top 元素到 <table class="bookinfo"> 元素的所有内容移动到新的 <div class="case"> 元素中。

演示:

>>> from bs4 import BeautifulSoup
>>> import re
>>> sample = '''\
... <body>
... <div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
...     &nbsp;
... </div>
... <div style='width:45%; float:left;'>
...     <h2 style='margin-top:0px;' rel=121018261>Case Number: 121018261</h2>
... </div>
... <div style='width:45%;float:right; text-align:right;'>
...     <div>Added: 10/22/2012</div>
... </div>
... <div style='width:100%;clear:both;'>
...     <b>Case Bond:</b> ,000,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
... </div>
... <table class='bookinfo 121018261' style='width:100%;'>
...     <tr><td><b>Charge 1  <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG <br /> <b>Report Number:</b> 120160423 <b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
...     <tr><td><b>Charge 2 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br /><b>Report Number:</b> 120160423<b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
... </table>
... 
... <div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
...     &nbsp;
... </div>
... <div style='width:45%; float:left;'>
...     <h2 style='margin-top:0px;' rel=121037010>Case Number: 121037010</h2>
... </div>
... <div style='width:45%;float:right; text-align:right;'>
...     <div>Added: 10/21/2012</div>
... </div>
... <div style='width:100%;clear:both;'>
...     <b>Case Bond:</b> 0,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
... </div>
... <table class='bookinfo 121037010' style='width:100%;'>
...     <tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050' target='_blank'>RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE) <br /> <b>Report Number:</b> 120345597 <b style='margin-left:10px;'>Report Agency:</b> 001 - SPOKANE COUNTY</td></tr>
... </table>
... </body>
... '''
>>> soup = BeautifulSoup(sample)
>>> case_top = soup.find_all(style=re.compile("border-top:solid 1px #666"))
>>> for case in case_top:
...     wrapper = soup.new_tag('div', **{'class': 'case'})
...     case.insert_before(wrapper)
...     while wrapper.next_sibling:
...         wrapper.append(wrapper.next_sibling)
...         if wrapper.find('table', class_='bookinfo'):
...             # moved over the bookinfo table, time to stop
...             break
... 
>>> soup.body
<body><div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;">
     
</div>
<div style="width:45%; float:left;">
<h2 rel="121018261" style="margin-top:0px;">Case Number: 121018261</h2>
</div>
<div style="width:45%;float:right; text-align:right;">
<div>Added: 10/22/2012</div>
</div>
<div style="width:100%;clear:both;">
<b>Case Bond:</b> ,000,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court
</div>
<table class="bookinfo 121018261" style="width:100%;">
<tr><td><b>Charge 1  <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG <br/> <b>Report Number:</b> 120160423 <b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr>
<tr><td><b>Charge 2 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br/><b>Report Number:</b> 120160423<b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr>
</table></div>
<div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;">
     
</div>
<div style="width:45%; float:left;">
<h2 rel="121037010" style="margin-top:0px;">Case Number: 121037010</h2>
</div>
<div style="width:45%;float:right; text-align:right;">
<div>Added: 10/21/2012</div>
</div>
<div style="width:100%;clear:both;">
<b>Case Bond:</b> 0,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court
</div>
<table class="bookinfo 121037010" style="width:100%;">
<tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050" target="_blank">RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE) <br/> <b>Report Number:</b> 120345597 <b style="margin-left:10px;">Report Agency:</b> 001 - SPOKANE COUNTY</td></tr>
</table></div>
</body>