如何在使用 Python 进行网页抓取时修复西里尔字符

Question

我正在使用 BeautifulSoup 抓取带有 python 的西里尔文网站，但我遇到了一些问题，每个词都显示如下：

Ð¡Ð¸Ð»ÑÐ°Ð½Ð¾Ð²ÑÐºÐ° ÐÐ°Ð²ÐºÐ¾Ð²Ð° Ð²Ð¾ ÐÐ°Ð·Ð¸

我也尝试了其他一些西里尔文网站，但它们运行良好。

我的代码是这样的：

from bs4 import BeautifulSoup
import requests

source = requests.get('https://').text

soup = BeautifulSoup(source, 'lxml')

print(soup.prettify())

我该如何解决？

Answer 1

requests 未能将其检测为 utf-8。

from bs4 import BeautifulSoup
import requests

source = requests.get('https://time.mk/')  # don't convert to text just yet

# print(source.encoding)
# prints out ISO-8859-1

source.encoding = 'utf-8'  # override encoding manually

soup = BeautifulSoup(source.text, 'lxml')  # this will now decode utf-8 correctly

如何在使用 Python 进行网页抓取时修复西里尔字符

How to fix Cyrillic characters while web-scraping with Python

python

beautifulsoup

character-encoding

web-scraping

cyrillic