转到 403 页面时进行网页抓取

Question

我是网络抓取的初学者，需要使用 Beautifulsoup 抓取 https://mirror-h.org/archive/page/1。但是它给出了错误并转到了 403 页面。我该如何解决这个问题？非常感谢您的帮助。

这是我的代码：

import requests
from bs4 import BeautifulSoup
import pandas

url = "https://mirror-h.org/archive/page/1"
page = pandas.read_html(url)
headers = {
    'user-agent:' 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
    }
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
print(soup)

我得到的错误是：

 raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Answer 1

import requests
import pandas as pd
from bs4 import BeautifulSoup


# make sure you insert the headers as a dict as you missed the : within your original code
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0'
}


def main(url):
    # included headers in request
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'lxml')
    # response 200
    print(r)


    # this is how you can use pandas with the previous headers to get 200 response text
    df = pd.read_html(r.text)
    print(df)  # you will get error --> ValueError: No tables found because you are dealing with JS website behind CloudFlare protection! try selenium then!
    


main('https://mirror-h.org/archive/page/1 ')

转到 403 页面时进行网页抓取

Web scraping when goes to 403 page

python

beautifulsoup

web-scraping-language