BeautifulSoup 无法阅读 wiki 页面

Question

我尝试使用 urllib 和 beautiful soup 阅读 wiki 页面，如下所示。

我按照这个试了

import urllib.parse as parse, urllib.request as request
from bs4 import BeautifulSoup

name = "メインページ"
root = 'https://ja.wikipedia.org/wiki/'
url = root + parse.quote_plus(name)

response = request.urlopen(url)
html = response.read()
print (html)

soup = BeautifulSoup(html.decode('UTF-8'), features="lxml")
print (soup)

代码运行没有错误，但无法读取日文字符。

Answer 1

您的方法似乎是正确的并且对我有用。尝试使用以下代码打印 soup 解析数据并检查输出。

soup = BeautifulSoup(html.decode('UTF-8'), features="lxml")
some_japanese = soup.find('div', {'id': 'mw-content-text'}).text.strip()
print(some_japanese)

就我而言，我得到以下信息（这是输出的一部分）-

威廉·巴特勒·叶芝（William Butler Yeats，1865 年 6 月 13 日－1939 年 1 月 28 日）是一位爱尔兰诗人和剧作家。他以从小耳熟能详的爱尔兰童话故事为基础的抒情诗受到关注，之后通过民间戏剧运动成为爱尔兰文学复兴的领导者。 ……

如果这对您不起作用，请尝试将 html 内容保存到文件中，并在浏览器中检查页面，看日文是否正确获取。)

BeautifulSoup 无法阅读 wiki 页面

Unable to read wiki page by BeautifulSoup

urllib

beautifulsoup

character-encoding

python-3.x