404 HTTP 错误,尽管能够在浏览器中看到该页面
404 HTTP error, despite being able to see the page in the browser
我正在尝试绘制此网站的地图,但在尝试完全抓取它时遇到了问题。我收到错误 404,即使 URL 存在。
这是我的代码:
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
csvFile = open("C:/Users/Pichau/codigo/govbr/brasil/govfederal/govbr/arquivos/teste.txt",'wt')
paginas = set()
def getLinks(pageUrl):
global paginas
html = urlopen("https://www.gov.br/pt-br/"+pageUrl)
bsObj = BeautifulSoup(html, "html.parser")
writer = csv.writer(csvFile)
for link in bsObj.findAll("a"):
if 'href' in link.attrs:
if link.attrs['href'] not in paginas:
#nova página encontrada
newPage = link.attrs['href']
print(newPage)
paginas.add(newPage)
getLinks(newPage)
csvRow = []
csvRow.append(newPage)
writer.writerow(csvRow)
getLinks("")
csvFile.close()
这是我收到的错误消息,在我尝试 运行 该代码后:
#wrapper
/
#main-navigation
#nolivesearchGadget
#tile-busca-input
#portal-footer
http://brasil.gov.br
Traceback (most recent call last):
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 26, in <module>
getLinks("")
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 20, in getLinks
getLinks(newPage)
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 20, in getLinks
getLinks(newPage)
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 20, in getLinks
getLinks(newPage)
[Previous line repeated 4 more times]
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 10, in getLinks
html = urlopen("https://www.gov.br/pt-br/"+pageUrl)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 214, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 523, in open
response = meth(req, response)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 632, in http_response
response = self.parent.error(
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 561, in error
return self._call_chain(*args)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 494, in _call_chain
result = func(*args)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 641, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
PS C:\Users\Pichau\codigo\govbr>
我尝试只用主 link 来做,它工作正常,但是一旦我将 pageurl
变量添加到 url,它就会给出我这个错误。我该如何解决这个错误?
据我所知,您是对的 - 页面就在那里...对于我们这些使用浏览器的人来说。我假设正在发生的是一些基本的 anti-botting 机制,它禁止不常见的 UserAgents,或者换句话说,只允许浏览器查看页面。然而,由于用户代理是我们可以控制的 header,我们可以操纵它,这样它就不会抛出 404 错误。
我现在无法输入代码,但你需要配对 this Whosebug answer describing how to change a header in urllib, and you must write some code which takes that answer and changes the "UserAgent" header to a value like Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36
, which I've taken from here。
更改 UserAgent 后 header,您应该能够成功下载页面。
我正在尝试绘制此网站的地图,但在尝试完全抓取它时遇到了问题。我收到错误 404,即使 URL 存在。
这是我的代码:
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
csvFile = open("C:/Users/Pichau/codigo/govbr/brasil/govfederal/govbr/arquivos/teste.txt",'wt')
paginas = set()
def getLinks(pageUrl):
global paginas
html = urlopen("https://www.gov.br/pt-br/"+pageUrl)
bsObj = BeautifulSoup(html, "html.parser")
writer = csv.writer(csvFile)
for link in bsObj.findAll("a"):
if 'href' in link.attrs:
if link.attrs['href'] not in paginas:
#nova página encontrada
newPage = link.attrs['href']
print(newPage)
paginas.add(newPage)
getLinks(newPage)
csvRow = []
csvRow.append(newPage)
writer.writerow(csvRow)
getLinks("")
csvFile.close()
这是我收到的错误消息,在我尝试 运行 该代码后:
#wrapper
/
#main-navigation
#nolivesearchGadget
#tile-busca-input
#portal-footer
http://brasil.gov.br
Traceback (most recent call last):
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 26, in <module>
getLinks("")
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 20, in getLinks
getLinks(newPage)
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 20, in getLinks
getLinks(newPage)
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 20, in getLinks
getLinks(newPage)
[Previous line repeated 4 more times]
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 10, in getLinks
html = urlopen("https://www.gov.br/pt-br/"+pageUrl)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 214, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 523, in open
response = meth(req, response)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 632, in http_response
response = self.parent.error(
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 561, in error
return self._call_chain(*args)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 494, in _call_chain
result = func(*args)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 641, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
PS C:\Users\Pichau\codigo\govbr>
我尝试只用主 link 来做,它工作正常,但是一旦我将 pageurl
变量添加到 url,它就会给出我这个错误。我该如何解决这个错误?
据我所知,您是对的 - 页面就在那里...对于我们这些使用浏览器的人来说。我假设正在发生的是一些基本的 anti-botting 机制,它禁止不常见的 UserAgents,或者换句话说,只允许浏览器查看页面。然而,由于用户代理是我们可以控制的 header,我们可以操纵它,这样它就不会抛出 404 错误。
我现在无法输入代码,但你需要配对 this Whosebug answer describing how to change a header in urllib, and you must write some code which takes that answer and changes the "UserAgent" header to a value like Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36
, which I've taken from here。
更改 UserAgent 后 header,您应该能够成功下载页面。