urllib.error.HTTPError: HTTP Error 403: Forbidden in my web scraping
urllib.error.HTTPError: HTTP Error 403: Forbidden in my web scraping
我尝试做一个网络抓取脚本,如果网站是 wordpress 或不是,它会告诉我,
但我得到这个错误:
urllib.error.HTTPError: HTTP Error 403: Forbidden
而且我不明白,我使用这个 headers 谁应该通过它(在其他堆栈溢出中):
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "fr-fr,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}
有我的功能;
def check_web_wp(url):
is_wordpress = False
print(repr(url))
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "fr-fr,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}
response = requests.get(url, headers=headers)
with urllib.request.urlopen(url) as response:
texte = response.read()
poste_string = str(texte)
splitted = poste_string.split()
for word in splitted:
if ("wordpress" in word):
is_wordpress = True
break
return is_wordpress
def main():
url = "https://icalendrier.fr/"
is_wp = check_web_wp(url)
我错过了什么吗?是网站太“安全”了?
感谢您的回答
(应要求,我的评论作为答案)
您的 with urllib.request.urlopen(url) as response:
行(没有 headers)正在覆盖 response = requests.get(url, headers=headers)
(带有 headers)之前的 response
object。
只使用 requests
而不是 urllib
,像这样:
def check_web_wp_fixed(url):
is_wordpress = False
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "fr-fr,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}
response = requests.get(url, headers=headers)
splitted = response.text.split()
for word in splitted:
if ("wordpress" in word):
is_wordpress = True
break
return is_wordpress
(只让它工作,没有检查是否可以以任何方式优化代码)
我尝试做一个网络抓取脚本,如果网站是 wordpress 或不是,它会告诉我, 但我得到这个错误:
urllib.error.HTTPError: HTTP Error 403: Forbidden
而且我不明白,我使用这个 headers 谁应该通过它(在其他堆栈溢出中):
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "fr-fr,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}
有我的功能;
def check_web_wp(url):
is_wordpress = False
print(repr(url))
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "fr-fr,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}
response = requests.get(url, headers=headers)
with urllib.request.urlopen(url) as response:
texte = response.read()
poste_string = str(texte)
splitted = poste_string.split()
for word in splitted:
if ("wordpress" in word):
is_wordpress = True
break
return is_wordpress
def main():
url = "https://icalendrier.fr/"
is_wp = check_web_wp(url)
我错过了什么吗?是网站太“安全”了?
感谢您的回答
(应要求,我的评论作为答案)
您的 with urllib.request.urlopen(url) as response:
行(没有 headers)正在覆盖 response = requests.get(url, headers=headers)
(带有 headers)之前的 response
object。
只使用 requests
而不是 urllib
,像这样:
def check_web_wp_fixed(url):
is_wordpress = False
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "fr-fr,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}
response = requests.get(url, headers=headers)
splitted = response.text.split()
for word in splitted:
if ("wordpress" in word):
is_wordpress = True
break
return is_wordpress
(只让它工作,没有检查是否可以以任何方式优化代码)