urllib.error.HTTPError: HTTP Error 403: Forbidden in my web scraping

Question

我尝试做一个网络抓取脚本，如果网站是 wordpress 或不是，它会告诉我，但我得到这个错误：

urllib.error.HTTPError: HTTP Error 403: Forbidden

而且我不明白，我使用这个 headers 谁应该通过它（在其他堆栈溢出中）：

   headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "fr-fr,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}

有我的功能；


def check_web_wp(url):
    is_wordpress = False
    print(repr(url))
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "fr-fr,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}

    response = requests.get(url, headers=headers)

    with urllib.request.urlopen(url) as response:
        texte = response.read()
        poste_string = str(texte)
        splitted = poste_string.split()
    
        for word in splitted:
            if ("wordpress" in word):
                is_wordpress = True
                break
            
    return is_wordpress


def main():
    url = "https://icalendrier.fr/"
    is_wp = check_web_wp(url)

我错过了什么吗？是网站太“安全”了？

感谢您的回答

Answer 1

（应要求，我的评论作为答案）

您的 with urllib.request.urlopen(url) as response: 行（没有 headers）正在覆盖 response = requests.get(url, headers=headers)（带有 headers）之前的 response object。

只使用 requests 而不是 urllib，像这样：

def check_web_wp_fixed(url):
    is_wordpress = False
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "fr-fr,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}

    response = requests.get(url, headers=headers)
    splitted = response.text.split()
    
    for word in splitted:
        if ("wordpress" in word):
            is_wordpress = True
            break
            
    return is_wordpress

（只让它工作，没有检查是否可以以任何方式优化代码）

urllib.error.HTTPError: HTTP Error 403: Forbidden in my web scraping

urllib.error.HTTPError: HTTP Error 403: Forbidden in my web scraping

python

request

python-3.x