为什么我无法在 Jupyterlab 中使用 BeautifulSoup4 解析本地文件

Question

我正在关注网络教程，尝试使用 BeautifulSoup4 从 Jupyterlab 中的 html 文件（存储在我的本地 PC 上）中提取数据，如下所示：

from bs4 import BeautifulSoup

with open ('simple.html') as html_file:
    simple = BeautifulSoup('html_file','lxml')

print(simple.prettify())

无论 html 文件而不是预期的 html

中有什么内容，我都会得到以下输出

<html>
 <body>
  <p>
   html_file
  </p>
 </body>
</html>

我也尝试过使用 html 解析器 html.parser，我只是得到 html_file 作为输出。我知道它可以找到该文件，因为当我运行从目录中删除代码后，我得到一个 FileNotFoundError。

当我从同一目录以交互方式运行 python 时，它工作得很好。我能够运行其他 BeautifulSoup 来解析网页。

我正在使用 Fedora 32 linux 以及 Python3、Jupyterlab、BeautifulSoup4、请求、使用 pipenv 安装在虚拟环境中的 lxml。

欢迎任何有助于查明问题的帮助。

Answer 1

您的问题出在这一行：

simple = BeautifulSoup('html_file','lxml')

特别是，您要告诉 BeautifulSoup 解析文字字符串 'html_file' 而不是变量 html_file.

的内容

将其更改为：

simple = BeautifulSoup(html_file,'lxml')

（注意 html_file 周围缺少引号）应该会给出所需的结果。

Why can I not get local files to parse using BeautifulSoup4 in Jupyterlab