Beautifulsoup特殊字符解析错误
Beautifulsoup special character parsing error
我正在使用 Beautiful Soup 和 urllib2 从互联网上收集内容。
这是我正在使用的代码。
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://plrplr.com/33717/mp3-player-guide/').read()
soup = BeautifulSoup(html, "lxml")
contents = soup.find('div', {'class': 'entry-content'})
print contents
但我得到的结果是这样的...
<div class="entry-content">
<p>MP3 player, also well known as digital audio player has become a staple of our gadget life. There are many brands of MP3 players on the market today. So, which MP3 player are the most suitable for you? That’s where this MP3 player guide comes in. <br/>
Basically, there are 3 types of MP3 player based on capacity: – <br/>
1. Hard drive MP3 player <br/>
– highest capacity <br/>
– largest in size <br/>
– heavy <br/>
– often labeled as an “Jukebox MP3 player� <br/>
– has moving parts <br/>
– example: Apple iPod video, Sony Network Walkman NW-HD5 <br/>
处理特殊字符时出现问题。
我怎样才能得到这样的准确源代码...
<div class="entry-content">
<p>MP3 player, also well known as digital audio player has become a staple of our gadget life. There are many brands of MP3 players on the market today. So, which MP3 player are the most suitable for you? That’s where this MP3 player guide comes in. </br><br />
Basically, there are 3 types of MP3 player based on capacity: – </br><br />
1. Hard drive MP3 player </br><br />
– highest capacity </br><br />
– largest in size </br><br />
– heavy </br><br />
– often labeled as an “Jukebox MP3 player” </br><br />
– has moving parts </br><br />
– example: Apple iPod video, Sony Network Walkman NW-HD5 </br><br />
我 运行 这段代码在 Windows 8 台机器上使用 Eclipse 和 pydev。
可能您正在寻找的是 contents.prettify(formatter="html")
显示实体代码而不是非 ascii 字母?
我无法在我的机器上测试它,但这里是我使用的文档:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters
我正在使用 Beautiful Soup 和 urllib2 从互联网上收集内容。 这是我正在使用的代码。
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://plrplr.com/33717/mp3-player-guide/').read()
soup = BeautifulSoup(html, "lxml")
contents = soup.find('div', {'class': 'entry-content'})
print contents
但我得到的结果是这样的...
<div class="entry-content">
<p>MP3 player, also well known as digital audio player has become a staple of our gadget life. There are many brands of MP3 players on the market today. So, which MP3 player are the most suitable for you? That’s where this MP3 player guide comes in. <br/>
Basically, there are 3 types of MP3 player based on capacity: – <br/>
1. Hard drive MP3 player <br/>
– highest capacity <br/>
– largest in size <br/>
– heavy <br/>
– often labeled as an “Jukebox MP3 player� <br/>
– has moving parts <br/>
– example: Apple iPod video, Sony Network Walkman NW-HD5 <br/>
处理特殊字符时出现问题。
我怎样才能得到这样的准确源代码...
<div class="entry-content">
<p>MP3 player, also well known as digital audio player has become a staple of our gadget life. There are many brands of MP3 players on the market today. So, which MP3 player are the most suitable for you? That’s where this MP3 player guide comes in. </br><br />
Basically, there are 3 types of MP3 player based on capacity: – </br><br />
1. Hard drive MP3 player </br><br />
– highest capacity </br><br />
– largest in size </br><br />
– heavy </br><br />
– often labeled as an “Jukebox MP3 player” </br><br />
– has moving parts </br><br />
– example: Apple iPod video, Sony Network Walkman NW-HD5 </br><br />
我 运行 这段代码在 Windows 8 台机器上使用 Eclipse 和 pydev。
可能您正在寻找的是 contents.prettify(formatter="html")
显示实体代码而不是非 ascii 字母?
我无法在我的机器上测试它,但这里是我使用的文档:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters