Python BS4不允许访问该网页

Question

首先我使用 html_doc=requests.get(x) 来阅读页面但是当我打印汤时，我得到了 403 禁止错误。

为了绕过这个，我添加了一个用户代理并使用了这个代码：html_doc=requests.get(x, headers=header) 然而，这一次，当我尝试打印汤时，出现了 400 Bad Request 错误。

有人可以指导我并帮助找到解决此问题的方法吗？

编辑 - 代码：

from bs4 import BeautifulSoup, NavigableString
from urllib import request
import requests
import lxml
from lxml import etree
from lxml import html
x='https://www.topstockresearch.com/INDIAN_STOCKS/COMPUTERS_SOFTWARE/Wipro_Ltd.html'
header = {'User Agent' : 'Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0)'}
html_doc=requests.get(x, headers=header)  #With header
html_doc=requests.get(x) #Without Header
soup = BeautifulSoup(html_doc.text, 'lxml')
print(soup)

URL: x=https://www.topstockresearch.com/INDIAN_STOCKS/COMPUTERS_SOFTWARE/Wipro_Ltd.html

感谢阅读！

EDIT2：使用此代码解决：

import requests

session = requests.Session()
response = session.get('https://www.topstockresearch.com/INDIAN_STOCKS/COMPUTERS_SOFTWARE/Wipro_Ltd.html', headers={'User-Agent': 'Mozilla/5.0'})

print(response.text)

PS：我只是在学习编码，这不是为了任何与工作相关的目的。只是一个与股市有关的个人项目。

Answer 1

您需要使用 User-Agent: 而不是 User Agent:。 HTTP headers 不应在其密钥中使用空格。

Python BS4不允许访问该网页

Python BS4 not allowed to access the web page

python

screen-scraping

beautifulsoup