urllib.error.HTTPError: HTTP Error 403: Forbidden
urllib.error.HTTPError: HTTP Error 403: Forbidden
我在抓取某些页面时收到错误 "urllib.error.HTTPError: HTTP Error 403: Forbidden",我明白在 header 中添加类似 hdr = {"User-Agent': 'Mozilla/5.0"}
的内容是解决此问题的方法。
但是,当我试图抓取的 URL 位于单独的源文件中时,我无法使其工作。 How/where 我可以在下面的代码中添加 User-Agent 吗?
from bs4 import BeautifulSoup
import urllib.request as urllib2
import time
list_open = open("source-urls.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")
i = 0
for url in line_in_list:
soup = BeautifulSoup(urllib2.urlopen(url).read(), 'html.parser')
name = soup.find(attrs={'class': "name"})
description = soup.find(attrs={'class': "description"})
for text in description:
print(name.get_text(), ';', description.get_text())
# time.sleep(5)
i += 1
您可以使用 requests
实现同样的效果
import requests
hdrs = {'User-Agent': 'Mozilla / 5.0 (X11 Linux x86_64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 52.0.2743.116 Safari / 537.36'}
for url in line_in_list:
resp = requests.get(url, headers=hdrs)
soup = BeautifulSoup(resp.content, 'html.parser')
name = soup.find(attrs={'class': "name"})
description = soup.find(attrs={'class': "description"})
for text in description:
print(name.get_text(), ';', description.get_text())
# time.sleep(5)
i += 1
希望对您有所帮助!
我在抓取某些页面时收到错误 "urllib.error.HTTPError: HTTP Error 403: Forbidden",我明白在 header 中添加类似 hdr = {"User-Agent': 'Mozilla/5.0"}
的内容是解决此问题的方法。
但是,当我试图抓取的 URL 位于单独的源文件中时,我无法使其工作。 How/where 我可以在下面的代码中添加 User-Agent 吗?
from bs4 import BeautifulSoup
import urllib.request as urllib2
import time
list_open = open("source-urls.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")
i = 0
for url in line_in_list:
soup = BeautifulSoup(urllib2.urlopen(url).read(), 'html.parser')
name = soup.find(attrs={'class': "name"})
description = soup.find(attrs={'class': "description"})
for text in description:
print(name.get_text(), ';', description.get_text())
# time.sleep(5)
i += 1
您可以使用 requests
import requests
hdrs = {'User-Agent': 'Mozilla / 5.0 (X11 Linux x86_64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 52.0.2743.116 Safari / 537.36'}
for url in line_in_list:
resp = requests.get(url, headers=hdrs)
soup = BeautifulSoup(resp.content, 'html.parser')
name = soup.find(attrs={'class': "name"})
description = soup.find(attrs={'class': "description"})
for text in description:
print(name.get_text(), ';', description.get_text())
# time.sleep(5)
i += 1
希望对您有所帮助!