转到 403 页面时进行网页抓取
Web scraping when goes to 403 page
我是网络抓取的初学者,需要使用 Beautifulsoup 抓取 https://mirror-h.org/archive/page/1。但是它给出了错误并转到了 403 页面。我该如何解决这个问题?非常感谢您的帮助。
这是我的代码:
import requests
from bs4 import BeautifulSoup
import pandas
url = "https://mirror-h.org/archive/page/1"
page = pandas.read_html(url)
headers = {
'user-agent:' 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
print(soup)
我得到的错误是:
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
import requests
import pandas as pd
from bs4 import BeautifulSoup
# make sure you insert the headers as a dict as you missed the : within your original code
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0'
}
def main(url):
# included headers in request
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
# response 200
print(r)
# this is how you can use pandas with the previous headers to get 200 response text
df = pd.read_html(r.text)
print(df) # you will get error --> ValueError: No tables found because you are dealing with JS website behind CloudFlare protection! try selenium then!
main('https://mirror-h.org/archive/page/1 ')
我是网络抓取的初学者,需要使用 Beautifulsoup 抓取 https://mirror-h.org/archive/page/1。但是它给出了错误并转到了 403 页面。我该如何解决这个问题?非常感谢您的帮助。
这是我的代码:
import requests
from bs4 import BeautifulSoup
import pandas
url = "https://mirror-h.org/archive/page/1"
page = pandas.read_html(url)
headers = {
'user-agent:' 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
print(soup)
我得到的错误是:
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
import requests
import pandas as pd
from bs4 import BeautifulSoup
# make sure you insert the headers as a dict as you missed the : within your original code
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0'
}
def main(url):
# included headers in request
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
# response 200
print(r)
# this is how you can use pandas with the previous headers to get 200 response text
df = pd.read_html(r.text)
print(df) # you will get error --> ValueError: No tables found because you are dealing with JS website behind CloudFlare protection! try selenium then!
main('https://mirror-h.org/archive/page/1 ')