如何在 Beautiful Soup 4 中插入未转义的 html 片段
How to insert unescaped html fragment in Beautiful Soup 4
我必须解析一些令人讨厌的政府创建的 html (http://www.spokanecounty.org/detentionservices/inmateroster/detail2.aspx?sysid=84060) 并且为了减轻我的痛苦我想在文档中插入一些 html 片段以将一些内容包装到更容易消化的块。
然而,BS4 转义了我要插入的 html 字符串片段 (<div class="case">
) 并将其转换为:
<div class="case">
我解析的相关html是这样的:
<div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
</div>
<div style='width:45%; float:left;'>
<h2 style='margin-top:0px;' rel=121018261>Case Number: 121018261</h2>
</div>
<div style='width:45%;float:right; text-align:right;'>
<div>Added: 10/22/2012</div>
</div>
<div style='width:100%;clear:both;'>
<b>Case Bond:</b> ,000,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
</div>
<table class='bookinfo 121018261' style='width:100%;'>
<tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG <br /> <b>Report Number:</b> 120160423 <b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
<tr><td><b>Charge 2 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br /><b>Report Number:</b> 120160423<b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
</table>
<div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
</div>
<div style='width:45%; float:left;'>
<h2 style='margin-top:0px;' rel=121037010>Case Number: 121037010</h2>
</div>
<div style='width:45%;float:right; text-align:right;'>
<div>Added: 10/21/2012</div>
</div>
<div style='width:100%;clear:both;'>
<b>Case Bond:</b> 0,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
</div>
<table class='bookinfo 121037010' style='width:100%;'>
<tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050' target='_blank'>RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE) <br /> <b>Report Number:</b> 120345597 <b style='margin-left:10px;'>Report Agency:</b> 001 - SPOKANE COUNTY</td></tr>
</table>
Python 代码如下所示:
case_top = soup.find_all(style=re.compile("border-top:solid 1px #666"))
for c in case_top:
c.insert_before(soup.new_string('<div class="case">'))
case_bottom = soup.find_all("table", class_="bookinfo")
for c in case_bottom:
c.insert_after(soup.new_string('</div'))
结果如下所示:
<div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;"> </div><div style="width:45%; float:left;"><h2 rel="121018261" style="margin-top:0px;">Case Number: 121018261</h2></div><div style="width:45%;float:right; text-align:right;"><div>Added: 10/22/2012</div></div><div style="width:100%;clear:both;"><b>Case Bond:</b> ,000,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court</div><table class="bookinfo 121018261" style="width:100%;"><tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br/><b>Report Number:</b> 120160423 <b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr><tr><td><b>Charge 2 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br/><b>Report Number:</b> 120160423 <b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr></table></div><div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;"> </div><div style="width:45%; float:left;"><h2 rel="121037010" style="margin-top:0px;">Case Number: 121037010</h2></div><div style="width:45%;float:right; text-align:right;"><div>Added: 10/21/2012</div></div><div style="width:100%;clear:both;"><b>Case Bond:</b> 0,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court</div><table class="bookinfo 121037010" style="width:100%;"><tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050" target="_blank">RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE)<br/><b>Report Number:</b> 120345597 <b style="margin-left:10px;">Report Agency:</b> 001 - SPOKANE COUNTY</td></tr></table></div>
那么问题来了,如何将未转义的 html 片段插入到文档中?
你告诉 BeautifulSoup 插入 字符串数据:
c.insert_before(soup.new_string('<div class="case">'))
任何对 HTML 字符串数据不安全的内容都会被转义。您反而想插入一个 标签对象 :
c.insert_before(soup.new_tag('div', **{'class': 'case'}))
这将创建一个新的子元素,它实际上并没有包装任何东西。
如果您想将每个单独的元素包装在其中,您可以使用 Element.wrap()
method:
c.wrap(soup.new_tag('div', **{'class': 'case'}))
但这一次只对一个标签有效。
为了包裹系列标签,唯一要做的就是移动标签;将位于一个地方的标签插入另一个地方可以有效地将它们移动过来:
case_top = soup.find_all(style=re.compile("border-top:solid 1px #666"))
for case in case_top:
wrapper = soup.new_tag('div', **{'class': 'case'})
case.insert_before(wrapper)
while wrapper.next_sibling:
wrapper.append(wrapper.next_sibling)
if wrapper.find('table', class_='bookinfo'):
# moved over the bookinfo table, time to stop
break
然后将 case_top
元素到 <table class="bookinfo">
元素的所有内容移动到新的 <div class="case">
元素中。
演示:
>>> from bs4 import BeautifulSoup
>>> import re
>>> sample = '''\
... <body>
... <div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
...
... </div>
... <div style='width:45%; float:left;'>
... <h2 style='margin-top:0px;' rel=121018261>Case Number: 121018261</h2>
... </div>
... <div style='width:45%;float:right; text-align:right;'>
... <div>Added: 10/22/2012</div>
... </div>
... <div style='width:100%;clear:both;'>
... <b>Case Bond:</b> ,000,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
... </div>
... <table class='bookinfo 121018261' style='width:100%;'>
... <tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG <br /> <b>Report Number:</b> 120160423 <b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
... <tr><td><b>Charge 2 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br /><b>Report Number:</b> 120160423<b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
... </table>
...
... <div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
...
... </div>
... <div style='width:45%; float:left;'>
... <h2 style='margin-top:0px;' rel=121037010>Case Number: 121037010</h2>
... </div>
... <div style='width:45%;float:right; text-align:right;'>
... <div>Added: 10/21/2012</div>
... </div>
... <div style='width:100%;clear:both;'>
... <b>Case Bond:</b> 0,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
... </div>
... <table class='bookinfo 121037010' style='width:100%;'>
... <tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050' target='_blank'>RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE) <br /> <b>Report Number:</b> 120345597 <b style='margin-left:10px;'>Report Agency:</b> 001 - SPOKANE COUNTY</td></tr>
... </table>
... </body>
... '''
>>> soup = BeautifulSoup(sample)
>>> case_top = soup.find_all(style=re.compile("border-top:solid 1px #666"))
>>> for case in case_top:
... wrapper = soup.new_tag('div', **{'class': 'case'})
... case.insert_before(wrapper)
... while wrapper.next_sibling:
... wrapper.append(wrapper.next_sibling)
... if wrapper.find('table', class_='bookinfo'):
... # moved over the bookinfo table, time to stop
... break
...
>>> soup.body
<body><div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;">
</div>
<div style="width:45%; float:left;">
<h2 rel="121018261" style="margin-top:0px;">Case Number: 121018261</h2>
</div>
<div style="width:45%;float:right; text-align:right;">
<div>Added: 10/22/2012</div>
</div>
<div style="width:100%;clear:both;">
<b>Case Bond:</b> ,000,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court
</div>
<table class="bookinfo 121018261" style="width:100%;">
<tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG <br/> <b>Report Number:</b> 120160423 <b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr>
<tr><td><b>Charge 2 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br/><b>Report Number:</b> 120160423<b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr>
</table></div>
<div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;">
</div>
<div style="width:45%; float:left;">
<h2 rel="121037010" style="margin-top:0px;">Case Number: 121037010</h2>
</div>
<div style="width:45%;float:right; text-align:right;">
<div>Added: 10/21/2012</div>
</div>
<div style="width:100%;clear:both;">
<b>Case Bond:</b> 0,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court
</div>
<table class="bookinfo 121037010" style="width:100%;">
<tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050" target="_blank">RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE) <br/> <b>Report Number:</b> 120345597 <b style="margin-left:10px;">Report Agency:</b> 001 - SPOKANE COUNTY</td></tr>
</table></div>
</body>
我必须解析一些令人讨厌的政府创建的 html (http://www.spokanecounty.org/detentionservices/inmateroster/detail2.aspx?sysid=84060) 并且为了减轻我的痛苦我想在文档中插入一些 html 片段以将一些内容包装到更容易消化的块。
然而,BS4 转义了我要插入的 html 字符串片段 (<div class="case">
) 并将其转换为:
<div class="case">
我解析的相关html是这样的:
<div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
</div>
<div style='width:45%; float:left;'>
<h2 style='margin-top:0px;' rel=121018261>Case Number: 121018261</h2>
</div>
<div style='width:45%;float:right; text-align:right;'>
<div>Added: 10/22/2012</div>
</div>
<div style='width:100%;clear:both;'>
<b>Case Bond:</b> ,000,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
</div>
<table class='bookinfo 121018261' style='width:100%;'>
<tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG <br /> <b>Report Number:</b> 120160423 <b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
<tr><td><b>Charge 2 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br /><b>Report Number:</b> 120160423<b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
</table>
<div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
</div>
<div style='width:45%; float:left;'>
<h2 style='margin-top:0px;' rel=121037010>Case Number: 121037010</h2>
</div>
<div style='width:45%;float:right; text-align:right;'>
<div>Added: 10/21/2012</div>
</div>
<div style='width:100%;clear:both;'>
<b>Case Bond:</b> 0,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
</div>
<table class='bookinfo 121037010' style='width:100%;'>
<tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050' target='_blank'>RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE) <br /> <b>Report Number:</b> 120345597 <b style='margin-left:10px;'>Report Agency:</b> 001 - SPOKANE COUNTY</td></tr>
</table>
Python 代码如下所示:
case_top = soup.find_all(style=re.compile("border-top:solid 1px #666"))
for c in case_top:
c.insert_before(soup.new_string('<div class="case">'))
case_bottom = soup.find_all("table", class_="bookinfo")
for c in case_bottom:
c.insert_after(soup.new_string('</div'))
结果如下所示:
<div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;"> </div><div style="width:45%; float:left;"><h2 rel="121018261" style="margin-top:0px;">Case Number: 121018261</h2></div><div style="width:45%;float:right; text-align:right;"><div>Added: 10/22/2012</div></div><div style="width:100%;clear:both;"><b>Case Bond:</b> ,000,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court</div><table class="bookinfo 121018261" style="width:100%;"><tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br/><b>Report Number:</b> 120160423 <b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr><tr><td><b>Charge 2 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br/><b>Report Number:</b> 120160423 <b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr></table></div><div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;"> </div><div style="width:45%; float:left;"><h2 rel="121037010" style="margin-top:0px;">Case Number: 121037010</h2></div><div style="width:45%;float:right; text-align:right;"><div>Added: 10/21/2012</div></div><div style="width:100%;clear:both;"><b>Case Bond:</b> 0,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court</div><table class="bookinfo 121037010" style="width:100%;"><tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050" target="_blank">RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE)<br/><b>Report Number:</b> 120345597 <b style="margin-left:10px;">Report Agency:</b> 001 - SPOKANE COUNTY</td></tr></table></div>
那么问题来了,如何将未转义的 html 片段插入到文档中?
你告诉 BeautifulSoup 插入 字符串数据:
c.insert_before(soup.new_string('<div class="case">'))
任何对 HTML 字符串数据不安全的内容都会被转义。您反而想插入一个 标签对象 :
c.insert_before(soup.new_tag('div', **{'class': 'case'}))
这将创建一个新的子元素,它实际上并没有包装任何东西。
如果您想将每个单独的元素包装在其中,您可以使用 Element.wrap()
method:
c.wrap(soup.new_tag('div', **{'class': 'case'}))
但这一次只对一个标签有效。
为了包裹系列标签,唯一要做的就是移动标签;将位于一个地方的标签插入另一个地方可以有效地将它们移动过来:
case_top = soup.find_all(style=re.compile("border-top:solid 1px #666"))
for case in case_top:
wrapper = soup.new_tag('div', **{'class': 'case'})
case.insert_before(wrapper)
while wrapper.next_sibling:
wrapper.append(wrapper.next_sibling)
if wrapper.find('table', class_='bookinfo'):
# moved over the bookinfo table, time to stop
break
然后将 case_top
元素到 <table class="bookinfo">
元素的所有内容移动到新的 <div class="case">
元素中。
演示:
>>> from bs4 import BeautifulSoup
>>> import re
>>> sample = '''\
... <body>
... <div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
...
... </div>
... <div style='width:45%; float:left;'>
... <h2 style='margin-top:0px;' rel=121018261>Case Number: 121018261</h2>
... </div>
... <div style='width:45%;float:right; text-align:right;'>
... <div>Added: 10/22/2012</div>
... </div>
... <div style='width:100%;clear:both;'>
... <b>Case Bond:</b> ,000,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
... </div>
... <table class='bookinfo 121018261' style='width:100%;'>
... <tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG <br /> <b>Report Number:</b> 120160423 <b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
... <tr><td><b>Charge 2 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br /><b>Report Number:</b> 120160423<b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
... </table>
...
... <div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
...
... </div>
... <div style='width:45%; float:left;'>
... <h2 style='margin-top:0px;' rel=121037010>Case Number: 121037010</h2>
... </div>
... <div style='width:45%;float:right; text-align:right;'>
... <div>Added: 10/21/2012</div>
... </div>
... <div style='width:100%;clear:both;'>
... <b>Case Bond:</b> 0,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
... </div>
... <table class='bookinfo 121037010' style='width:100%;'>
... <tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050' target='_blank'>RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE) <br /> <b>Report Number:</b> 120345597 <b style='margin-left:10px;'>Report Agency:</b> 001 - SPOKANE COUNTY</td></tr>
... </table>
... </body>
... '''
>>> soup = BeautifulSoup(sample)
>>> case_top = soup.find_all(style=re.compile("border-top:solid 1px #666"))
>>> for case in case_top:
... wrapper = soup.new_tag('div', **{'class': 'case'})
... case.insert_before(wrapper)
... while wrapper.next_sibling:
... wrapper.append(wrapper.next_sibling)
... if wrapper.find('table', class_='bookinfo'):
... # moved over the bookinfo table, time to stop
... break
...
>>> soup.body
<body><div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;">
</div>
<div style="width:45%; float:left;">
<h2 rel="121018261" style="margin-top:0px;">Case Number: 121018261</h2>
</div>
<div style="width:45%;float:right; text-align:right;">
<div>Added: 10/22/2012</div>
</div>
<div style="width:100%;clear:both;">
<b>Case Bond:</b> ,000,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court
</div>
<table class="bookinfo 121018261" style="width:100%;">
<tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG <br/> <b>Report Number:</b> 120160423 <b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr>
<tr><td><b>Charge 2 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br/><b>Report Number:</b> 120160423<b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr>
</table></div>
<div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;">
</div>
<div style="width:45%; float:left;">
<h2 rel="121037010" style="margin-top:0px;">Case Number: 121037010</h2>
</div>
<div style="width:45%;float:right; text-align:right;">
<div>Added: 10/21/2012</div>
</div>
<div style="width:100%;clear:both;">
<b>Case Bond:</b> 0,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court
</div>
<table class="bookinfo 121037010" style="width:100%;">
<tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050" target="_blank">RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE) <br/> <b>Report Number:</b> 120345597 <b style="margin-left:10px;">Report Agency:</b> 001 - SPOKANE COUNTY</td></tr>
</table></div>
</body>