使用 python 和 bs4 从网站抓取数据时显示 HTTP 403 forbidden
HTTP 403 forbidden is showing while scrapping a data from a website using python and bs4
当抓取数据时显示 403 禁止错误,那么如何进一步从网站抓取数据
如何抓取数据请指导我,我是网络抓取的初学者
import requests, bs4
from bs4 import BeautifulSoup
from csv import writer
url = "https://www.zocdoc.com/gastroenterologists/2"
page = requests.get(url)
page
page.status_code
**403**
page.content
soup = BeautifulSoup(page.content, 'html.parser')
soup
print(soup.prettify())
users = soup.find_all('div', {'data-test': 'search-content'})
for use in users:
doctor = use.find('span', attrs={'data-test': 'doctor-card-info-name-full'})#.replace('\n', '')
specialty = list.find('div', {'data-test': 'doctor-card-info-specialty'})
#specialty = list.find('div', class_="sc-192nc1l-0 buuGUI overflown")#.text.replace('\n', '')
#price = list.find('div', class_="listing-search-item__price").text.replace('\n', '')
#area = list.find('div', class_="listing-search-item__features").text.replace('\n', '')
info = [doctor] #price, area]
print(info)
也没有得到输出,在此网站中没有用于获取或查找所有数据的部分标签
在这种情况下,您可以通过以下方式克服 403 错误:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0'}
HOST = 'https://www.zocdoc.com'
PAGE = 'gastroenterologists/2'
with requests.Session() as session:
(r := session.get(HOST, headers=headers)).raise_for_status()
(r := session.get(f'{HOST}/{PAGE}', headers=headers)).raise_for_status()
# process content from here
当抓取数据时显示 403 禁止错误,那么如何进一步从网站抓取数据
如何抓取数据请指导我,我是网络抓取的初学者
import requests, bs4
from bs4 import BeautifulSoup
from csv import writer
url = "https://www.zocdoc.com/gastroenterologists/2"
page = requests.get(url)
page
page.status_code
**403**
page.content
soup = BeautifulSoup(page.content, 'html.parser')
soup
print(soup.prettify())
users = soup.find_all('div', {'data-test': 'search-content'})
for use in users:
doctor = use.find('span', attrs={'data-test': 'doctor-card-info-name-full'})#.replace('\n', '')
specialty = list.find('div', {'data-test': 'doctor-card-info-specialty'})
#specialty = list.find('div', class_="sc-192nc1l-0 buuGUI overflown")#.text.replace('\n', '')
#price = list.find('div', class_="listing-search-item__price").text.replace('\n', '')
#area = list.find('div', class_="listing-search-item__features").text.replace('\n', '')
info = [doctor] #price, area]
print(info)
也没有得到输出,在此网站中没有用于获取或查找所有数据的部分标签
在这种情况下,您可以通过以下方式克服 403 错误:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0'}
HOST = 'https://www.zocdoc.com'
PAGE = 'gastroenterologists/2'
with requests.Session() as session:
(r := session.get(HOST, headers=headers)).raise_for_status()
(r := session.get(f'{HOST}/{PAGE}', headers=headers)).raise_for_status()
# process content from here