如何在使用 Python 进行网页抓取时修复西里尔字符
How to fix Cyrillic characters while web-scraping with Python
我正在使用 BeautifulSoup 抓取带有 python 的西里尔文网站,但我遇到了一些问题,每个词都显示如下:
СилÑановÑка Ðавкова во Ðази
我也尝试了其他一些西里尔文网站,但它们运行良好。
我的代码是这样的:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://').text
soup = BeautifulSoup(source, 'lxml')
print(soup.prettify())
我该如何解决?
requests
未能将其检测为 utf-8
。
from bs4 import BeautifulSoup
import requests
source = requests.get('https://time.mk/') # don't convert to text just yet
# print(source.encoding)
# prints out ISO-8859-1
source.encoding = 'utf-8' # override encoding manually
soup = BeautifulSoup(source.text, 'lxml') # this will now decode utf-8 correctly
我正在使用 BeautifulSoup 抓取带有 python 的西里尔文网站,但我遇到了一些问题,每个词都显示如下:
СилÑановÑка Ðавкова во Ðази
我也尝试了其他一些西里尔文网站,但它们运行良好。
我的代码是这样的:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://').text
soup = BeautifulSoup(source, 'lxml')
print(soup.prettify())
我该如何解决?
requests
未能将其检测为 utf-8
。
from bs4 import BeautifulSoup
import requests
source = requests.get('https://time.mk/') # don't convert to text just yet
# print(source.encoding)
# prints out ISO-8859-1
source.encoding = 'utf-8' # override encoding manually
soup = BeautifulSoup(source.text, 'lxml') # this will now decode utf-8 correctly