urllib.request 的 urlopen 无法在 python 3.7 中打开页面

Question

我想写网络爬虫从Medium.com网页收集文章标题。

我正在尝试编写一个 python 脚本，它将 从 Medium.com 网站上抓取 标题。我正在使用 python 3.7 并从 urllib.request 导入 urlopen。但它无法打开网站并显示

 "urllib.error.HTTPError: HTTP Error 403: Forbidden" error.

from bs4 import BeautifulSoup
from urllib.request import  urlopen

webAdd = urlopen("https://medium.com/")
bsObj = BeautifulSoup(webAdd.read())

Result = urllib.error.HTTPError: HTTP Error 403: Forbidden

预期的结果是它不会显示任何错误，只需阅读网站即可。

但是当我使用 requests 模块时不会发生这种情况。

import requests 
from bs4 import BeautifulSoup 
url = 'https://medium.com/' 
response = requests.get(url, timeout=5)

这一次它可以正常工作。

为什么 ??

Answer 1

Urllib 是一个非常古老的小模块。对于网络抓取，建议使用 requests 模块。 You can check out this answer for additional information.

Answer 2

如今许多网站都会检查用户代理的来源，以试图阻止机器人程序。 requests 是更好用的模块，但如果你真的想使用 urllib，你可以更改 headers 文本，假装是 Firefox 或其他东西，这样它就不是阻止。可以在此处找到快速示例：

https://whosebug.com/a/16187955

import urllib.request

user_agent = 'Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/firefoxversion'

url = "http://example.com"
request = urllib.request.Request(url)
request.add_header('User-Agent', user_agent)
response = urllib.request.urlopen(request)

您还需要使用适当版本的内容更改 user_agent 字符串。希望这有帮助。

Answer 3

这对我有用

import urllib 
from urllib.request import urlopen
html = urlopen(MY_URL)
contents = html.read()
print(contents)

urllib.request 的 urlopen 无法在 python 3.7 中打开页面

urlopen of urllib.request cannot open a page in python 3.7

python

urllib