Python Web 抓取工具上的 HTTP 错误 406 从 Python All in One for Dummies 复制
HTTP Error 406 on Python Web scraper copy from Python All in One for Dummies
大家下午好,
我正在关注 Python All In One for Dummies 并且已经来到 web-scraping 的章节。我正在尝试与他们专门为本章设计的网站进行交互,但在我的所有请求中不断收到“HTTP 错误 406”。最初的“打开页面并获得响应有同样的问题,直到我将其指向 Google,因此确定是该网页有问题。
这是我的代码:
# get request module from URL lib
from urllib import request
# Get Beautiful Soup to help with the scraped data
from bs4 import BeautifulSoup
# sample page for practice
page_url = 'https://alansimpson.me/python/scrape_sample.html'
# open that page:
rawpage = request.urlopen(page_url)
#make a BS object from the html page
soup = BeautifulSoup(rawpage, 'html5lib')
# isolate the content block
content = soup.article
# create an empty list for dictionary items
links_list = []
#loop through all the links in the article
for link in content.find_all('a'):
try:
url = link.get('href')
img = link.img.get('src')
text = link.span.text
links_list[{'url':url, 'img':img, 'text':text}]
except AttributeError:
pass
print(links_list)
这是控制台中的输出:
(base) youngdad33@penguin:~/Python/AIO Python$ /usr/bin/python3 "/home/youngdad33/Python/AIO Python/webscrapper.py"
Traceback (most recent call last):
File "/home/youngdad33/Python/AIO Python/webscrapper.py", line 10, in <module>
rawpage = request.urlopen(page_url)
File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.7/urllib/request.py", line 531, in open
response = meth(req, response)
File "/usr/lib/python3.7/urllib/request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.7/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/usr/lib/python3.7/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 406: Not Acceptable
我收集到的最重要的一行是底部的“HTTP 错误 406:不可接受”,经过一些挖掘我明白这意味着我的请求 headers 没有被接受。
那么我该如何让它工作呢?我在 Chromebook 上使用 VS Code,在 Anaconda 3 上使用 Linux Debian。
谢谢!
您需要按如下方式注入用户代理:
# get request module from URL lib
import requests
# Get Beautiful Soup to help with the scraped data
from bs4 import BeautifulSoup
# sample page for practice
page_url = 'https://alansimpson.me/python/scrape_sample.html'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
# open that page:
rawpage = requests.get(page_url,headers=headers)
#make a BS object from the html page
soup = BeautifulSoup(rawpage.content, 'html5lib')
# isolate the content block
content = soup.article
# create an empty list for dictionary items
links_list = []
#loop through all the links in the article
for link in content.find_all('a'):
try:
url = link.get('href')
img = link.img.get('src')
text = link.span.text
links_list.append([{'url':url, 'img':img, 'text':text}])
except AttributeError:
pass
print(links_list)
输出
[[{'url': 'http://www.sixthresearcher.com/python-3-reference-cheat-sheet-for-beginners/', 'img': '../datascience/python/basics/basics256.jpg', 'text': 'Basics'}], [{'url':
'https://alansimpson.me/datascience/python/beginner/', 'img': '../datascience/python/beginner/beginner256.jpg', 'text': 'Beginner'}], [{'url': 'https://alansimpson.me/datascience/python/justbasics/', 'img': '../datascience/python/justbasics/justbasics256.jpg', 'text': 'Just the Basics'}], [{'url': 'https://alansimpson.me/datascience/python/cheatography/', 'img': '../datascience/python/cheatography/cheatography256.jpg', 'text': 'Cheatography'}], [{'url': 'https://alansimpson.me/datascience/python/dataquest/', 'img': '../datascience/python/dataquest/dataquest256.jpg', 'text': 'Dataquest'}], [{'url': 'https://alansimpson.me/datascience/python/essentials/', 'img': '../datascience/python/essentials/essentials256.jpg', 'text': 'Essentials'}], [{'url': 'https://alansimpson.me/datascience/python/memento/', 'img': '../datascience/python/memento/memento256.jpg', 'text': 'Memento'}], [{'url': 'https://alansimpson.me/datascience/python/syntax/', 'img': '../datascience/python/syntax/syntax256.jpg', 'text': 'Syntax'}], [{'url': 'https://alansimpson.me/datascience/python/classes/', 'img': '../datascience/python/classes/classes256.jpg', 'text': 'Classes'}], [{'url': 'https://alansimpson.me/datascience/python/dictionaries/', 'img': '../datascience/python/dictionaries/dictionaries256.jpg', 'text': 'Dictionaries'}], [{'url': 'https://alansimpson.me/datascience/python/functions/', 'img': '../datascience/python/functions/functions256.jpg', 'text': 'Functions'}], [{'url': 'https://alansimpson.me/datascience/python/ifwhile/', 'img': '../datascience/python/ifwhile/ifwhile256.jpg', 'text': 'If & While Loops'}], [{'url': 'https://alansimpson.me/datascience/python/lists/', 'img': '../datascience/python/lists/lists256.jpg', 'text': 'Lists'}]]
大家下午好,
我正在关注 Python All In One for Dummies 并且已经来到 web-scraping 的章节。我正在尝试与他们专门为本章设计的网站进行交互,但在我的所有请求中不断收到“HTTP 错误 406”。最初的“打开页面并获得响应有同样的问题,直到我将其指向 Google,因此确定是该网页有问题。 这是我的代码:
# get request module from URL lib
from urllib import request
# Get Beautiful Soup to help with the scraped data
from bs4 import BeautifulSoup
# sample page for practice
page_url = 'https://alansimpson.me/python/scrape_sample.html'
# open that page:
rawpage = request.urlopen(page_url)
#make a BS object from the html page
soup = BeautifulSoup(rawpage, 'html5lib')
# isolate the content block
content = soup.article
# create an empty list for dictionary items
links_list = []
#loop through all the links in the article
for link in content.find_all('a'):
try:
url = link.get('href')
img = link.img.get('src')
text = link.span.text
links_list[{'url':url, 'img':img, 'text':text}]
except AttributeError:
pass
print(links_list)
这是控制台中的输出:
(base) youngdad33@penguin:~/Python/AIO Python$ /usr/bin/python3 "/home/youngdad33/Python/AIO Python/webscrapper.py"
Traceback (most recent call last):
File "/home/youngdad33/Python/AIO Python/webscrapper.py", line 10, in <module>
rawpage = request.urlopen(page_url)
File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.7/urllib/request.py", line 531, in open
response = meth(req, response)
File "/usr/lib/python3.7/urllib/request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.7/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/usr/lib/python3.7/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 406: Not Acceptable
我收集到的最重要的一行是底部的“HTTP 错误 406:不可接受”,经过一些挖掘我明白这意味着我的请求 headers 没有被接受。
那么我该如何让它工作呢?我在 Chromebook 上使用 VS Code,在 Anaconda 3 上使用 Linux Debian。
谢谢!
您需要按如下方式注入用户代理:
# get request module from URL lib
import requests
# Get Beautiful Soup to help with the scraped data
from bs4 import BeautifulSoup
# sample page for practice
page_url = 'https://alansimpson.me/python/scrape_sample.html'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
# open that page:
rawpage = requests.get(page_url,headers=headers)
#make a BS object from the html page
soup = BeautifulSoup(rawpage.content, 'html5lib')
# isolate the content block
content = soup.article
# create an empty list for dictionary items
links_list = []
#loop through all the links in the article
for link in content.find_all('a'):
try:
url = link.get('href')
img = link.img.get('src')
text = link.span.text
links_list.append([{'url':url, 'img':img, 'text':text}])
except AttributeError:
pass
print(links_list)
输出
[[{'url': 'http://www.sixthresearcher.com/python-3-reference-cheat-sheet-for-beginners/', 'img': '../datascience/python/basics/basics256.jpg', 'text': 'Basics'}], [{'url':
'https://alansimpson.me/datascience/python/beginner/', 'img': '../datascience/python/beginner/beginner256.jpg', 'text': 'Beginner'}], [{'url': 'https://alansimpson.me/datascience/python/justbasics/', 'img': '../datascience/python/justbasics/justbasics256.jpg', 'text': 'Just the Basics'}], [{'url': 'https://alansimpson.me/datascience/python/cheatography/', 'img': '../datascience/python/cheatography/cheatography256.jpg', 'text': 'Cheatography'}], [{'url': 'https://alansimpson.me/datascience/python/dataquest/', 'img': '../datascience/python/dataquest/dataquest256.jpg', 'text': 'Dataquest'}], [{'url': 'https://alansimpson.me/datascience/python/essentials/', 'img': '../datascience/python/essentials/essentials256.jpg', 'text': 'Essentials'}], [{'url': 'https://alansimpson.me/datascience/python/memento/', 'img': '../datascience/python/memento/memento256.jpg', 'text': 'Memento'}], [{'url': 'https://alansimpson.me/datascience/python/syntax/', 'img': '../datascience/python/syntax/syntax256.jpg', 'text': 'Syntax'}], [{'url': 'https://alansimpson.me/datascience/python/classes/', 'img': '../datascience/python/classes/classes256.jpg', 'text': 'Classes'}], [{'url': 'https://alansimpson.me/datascience/python/dictionaries/', 'img': '../datascience/python/dictionaries/dictionaries256.jpg', 'text': 'Dictionaries'}], [{'url': 'https://alansimpson.me/datascience/python/functions/', 'img': '../datascience/python/functions/functions256.jpg', 'text': 'Functions'}], [{'url': 'https://alansimpson.me/datascience/python/ifwhile/', 'img': '../datascience/python/ifwhile/ifwhile256.jpg', 'text': 'If & While Loops'}], [{'url': 'https://alansimpson.me/datascience/python/lists/', 'img': '../datascience/python/lists/lists256.jpg', 'text': 'Lists'}]]