如何使用 beautifulsoup 提取 html?
How to extract html using beautifulsoup?
HTML 来源是
html = """
<td>
<a href="/urlM5CLw" target="_blank">
<img alt="I" height="132" src="VZhAy" width="132"/>
</a>
<br/>
<cite title="mac-os-x-lion-icon-pack.en.softonic.com">
mac-os-x-lion-icon-pac...
</cite>
<br/>
<b>
Mac
</b>
OS X Lion Icon Pack's
<br/>
535 × 535 - 135k - png
</td>"""
我的python代码
soup = BeautifulSoup(html)
text = soup.find('td').renderContents()
通过这些代码我可以得到类似
的字符串
<a href="/urlM5CLw" target="_blank"><img alt="I" height="132" src="VZhAy" width="132"/></a><br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com">mac-os-x-lion-icon-pac...</cite><br/><b>Mac</b> OS X Lion Icon Pack's<br/>535 × 535 - 135k - png
但我不要<a>....</a>
,我只需要:
<br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com">mac-os-x-lion-icon-pac...</cite><br/><b>Mac</b> OS X Lion Icon Pack's<br/>535 × 535 - 135k - png
尝试删除 <a>
标签,然后获取您想要的内容。
>>> soup.find('a').extract()
>>> text = soup.find('td').renderContents()
>>> text
'<br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com">mac-os-x-lion-icon-pac...</cite><br/><b>Mac</b> OS X Lion Icon Pack's<br/>535 \xd7 535 - 135k - png'
您可以使用 Tag.decompose()
method to remove the a
tag and completely destroy his contents also you may need to decode()
您的字节字符串并将所有出现的 \n
替换为 ''
。
soup = BeautifulSoup(html, 'lxml')
soup.a.decompose()
print(soup.td.renderContents().decode().replace('\n', ''))
产量:
<br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com"> mac-os-x-lion-icon-pac... </cite><br/><b> Mac </b> OS X Lion Icon Pack's <br/> 535 × 535 - 135k - png
HTML 来源是
html = """
<td>
<a href="/urlM5CLw" target="_blank">
<img alt="I" height="132" src="VZhAy" width="132"/>
</a>
<br/>
<cite title="mac-os-x-lion-icon-pack.en.softonic.com">
mac-os-x-lion-icon-pac...
</cite>
<br/>
<b>
Mac
</b>
OS X Lion Icon Pack's
<br/>
535 × 535 - 135k - png
</td>"""
我的python代码
soup = BeautifulSoup(html)
text = soup.find('td').renderContents()
通过这些代码我可以得到类似
的字符串<a href="/urlM5CLw" target="_blank"><img alt="I" height="132" src="VZhAy" width="132"/></a><br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com">mac-os-x-lion-icon-pac...</cite><br/><b>Mac</b> OS X Lion Icon Pack's<br/>535 × 535 - 135k - png
但我不要<a>....</a>
,我只需要:
<br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com">mac-os-x-lion-icon-pac...</cite><br/><b>Mac</b> OS X Lion Icon Pack's<br/>535 × 535 - 135k - png
尝试删除 <a>
标签,然后获取您想要的内容。
>>> soup.find('a').extract()
>>> text = soup.find('td').renderContents()
>>> text
'<br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com">mac-os-x-lion-icon-pac...</cite><br/><b>Mac</b> OS X Lion Icon Pack's<br/>535 \xd7 535 - 135k - png'
您可以使用 Tag.decompose()
method to remove the a
tag and completely destroy his contents also you may need to decode()
您的字节字符串并将所有出现的 \n
替换为 ''
。
soup = BeautifulSoup(html, 'lxml')
soup.a.decompose()
print(soup.td.renderContents().decode().replace('\n', ''))
产量:
<br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com"> mac-os-x-lion-icon-pac... </cite><br/><b> Mac </b> OS X Lion Icon Pack's <br/> 535 × 535 - 135k - png