urlopen('http.....').read() 中的 read() 有什么作用？ [urllib]

Question

嗨，我正在阅读 "Web Scraping with Python (2015)"。我看到了以下两种打开 url 的方式，使用和不使用 .read()。参见 bs1 和 bs2

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html')
bs1 = BeautifulSoup(html.read(), 'html.parser')

html = urlopen('http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html')
bs2 = BeautifulSoup(html, 'html.parser')

bs1 == bs2 # true


print(bs1.prettify()[0:100])
print(bs2.prettify()[0:100]) # prints same thing

所以.read()是多余的吗？谢谢

Web scpraing p7 上的代码 python: (use .read())

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())

p15 上的代码（没有 .read()）

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html)

Answer 1

urllib.request.urlopen returns 一个类似文件的对象，它的 read 方法将 return 那个 url 的响应主体。

BeautifulSoup 构造函数接受字符串或打开的文件句柄，所以是的，read() 在这里是多余的。

Answer 2

引用 BS docs:

To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:

当您使用 .read() 方法时，您使用的是 "string" 接口。当你不在时，你正在使用 "filehandle" 界面。

实际上它的工作方式相同（尽管 BS4 可能会以惰性方式读取类文件对象）。在您的情况下，整个内容都被读取到字符串对象（它可能会不必要地消耗更多内存）。

Answer 3

无BeautifulSoup模块

.read() 在您不使用 "BeautifulSoup" 模块时很有用，因此在这种情况下它是非冗余的。仅当您使用 .read() 时，您将获得 html 内容，否则您将只有 .urlopen()

返回的对象

与BeautifulSoup模块

BS模块有2个构造函数用于此功能，一个接受String，另一个接受.urlopen(some-site)

返回的对象

urlopen('http.....').read() 中的 read() 有什么作用？ [urllib]

what does read() in urlopen('http.....').read() do? [urllib]

python

urllib

beautifulsoup