BeautifulSoup 某些站点返回 403 错误

BeautifulSoup returning 403 error for some sites

我不明白为什么我会收到其中一些网站的 403 错误。

如果我手动访问 URL,页面加载正常。除了 403 响应之外没有任何错误消息,所以我不知道如何诊断问题。

from bs4 import BeautifulSoup
import requests    

test_sites = [
 'http://fashiontoast.com/',
 'http://becauseimaddicted.net/',
 'http://www.lefashion.com/',
 'http://www.seaofshoes.com/',
 ]

for site in test_sites:
    print(site)
    #get page soure
    response = requests.get(site)
    print(response)
    #print(response.text)

运行 上述代码的结果是...

http://fashiontoast.com/

Response [403]

http://becauseimaddicted.net/

Response [403]

http://www.lefashion.com/

Response [200]

http://www.seaofshoes.com/

Response [200]

谁能帮我了解问题的原因和解决方案?

有时页面会拒绝未识别 User-Agent 的 GET 请求。

使用浏览器访问页面 (Chrome)。右击然后 'Inspect'。复制 GET 请求的 User-Agent header(在“网络”选项卡中查看。

from bs4 import BeautifulSoup
import requests

with requests.Session() as se:
    se.headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
        "Accept-Encoding": "gzip, deflate",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en"
    }


test_sites = [
 'http://fashiontoast.com/',
 'http://becauseimaddicted.net/',
 'http://www.lefashion.com/',
 'http://www.seaofshoes.com/',
 ]

for site in test_sites:
    print(site)
    #get page soure
    response = se.get(site)
    print(response)
    #print(response.text)

输出:

http://fashiontoast.com/
<Response [200]>
http://becauseimaddicted.net/
<Response [200]>
http://www.lefashion.com/
<Response [200]>
http://www.seaofshoes.com/
<Response [200]>