通过 bs4 打印抓取的网页时出错
Error in printing scraped webpage through bs4
代码:
import requests
import urllib
from bs4 import BeautifulSoup
page1 = urllib.request.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
print(soup.get_text())
print(soup.prettify())
错误:
Traceback (most recent call last):
File "C:\Users\sony\Desktop\Trash\Crawler Try\try2.py", line 9, in <module>
print(soup.get_text())
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u014d' in position 10487: character maps to <undefined>
我认为问题主要在于 urlib 包。这里我使用的是 urllib3 包。他们将 urlopen 语法从 2 版本更改为 3 版本,这可能是错误的原因。但话虽如此,我只包含了最新的语法。
Python 版本 3.4
因为你正在导入 requests
你可以像这样使用它而不是 urllib:
import requests
from bs4 import BeautifulSoup
page1 = requests.get("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1.text)
print(soup.get_text())
print(soup.prettify())
您的问题是 python 无法对您正在抓取的页面中的字符进行编码。有关更多信息,请参见此处:
由于维基百科页面是 UTF-8,似乎 BeautifulSoup 猜测的编码不正确。尝试在您的代码中传递 from_encoding
参数,如下所示:
soup = BeautifulSoup(page1.text, from_encoding="UTF-8")
有关 BeautifulSoup 中编码的更多信息,请查看此处:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings
我正在使用 Python2.7,所以我在 urllib 模块中没有 request
方法。
#!/usr/bin/python3
# coding: utf-8
import requests
from bs4 import BeautifulSoup
URL = "http://en.wikipedia.org/wiki/List_of_human_stampedes"
soup = BeautifulSoup(requests.get(URL).text)
print(soup.get_text())
print(soup.prettify())
将这些打印行放在 Try-Catch 块中,这样即使出现非法字符,也不会出错。
try:
print(soup.get_text())
print(soup.prettify())
except Exception:
print(str(soup.get_text().encode("utf-8")))
print(str(soup.prettify().encode("utf-8")))
代码:
import requests
import urllib
from bs4 import BeautifulSoup
page1 = urllib.request.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
print(soup.get_text())
print(soup.prettify())
错误:
Traceback (most recent call last):
File "C:\Users\sony\Desktop\Trash\Crawler Try\try2.py", line 9, in <module>
print(soup.get_text())
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u014d' in position 10487: character maps to <undefined>
我认为问题主要在于 urlib 包。这里我使用的是 urllib3 包。他们将 urlopen 语法从 2 版本更改为 3 版本,这可能是错误的原因。但话虽如此,我只包含了最新的语法。 Python 版本 3.4
因为你正在导入 requests
你可以像这样使用它而不是 urllib:
import requests
from bs4 import BeautifulSoup
page1 = requests.get("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1.text)
print(soup.get_text())
print(soup.prettify())
您的问题是 python 无法对您正在抓取的页面中的字符进行编码。有关更多信息,请参见此处:
由于维基百科页面是 UTF-8,似乎 BeautifulSoup 猜测的编码不正确。尝试在您的代码中传递 from_encoding
参数,如下所示:
soup = BeautifulSoup(page1.text, from_encoding="UTF-8")
有关 BeautifulSoup 中编码的更多信息,请查看此处:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings
我正在使用 Python2.7,所以我在 urllib 模块中没有 request
方法。
#!/usr/bin/python3
# coding: utf-8
import requests
from bs4 import BeautifulSoup
URL = "http://en.wikipedia.org/wiki/List_of_human_stampedes"
soup = BeautifulSoup(requests.get(URL).text)
print(soup.get_text())
print(soup.prettify())
将这些打印行放在 Try-Catch 块中,这样即使出现非法字符,也不会出错。
try:
print(soup.get_text())
print(soup.prettify())
except Exception:
print(str(soup.get_text().encode("utf-8")))
print(str(soup.prettify().encode("utf-8")))