Web Scraping with Python: beautiful soup: bs4: <h1> Error 200 OK </h1>

Question

我正在使用 Python 3，我正在尝试简单地下载如下网站的内容：

# IMPORTS --------------------------------------------------------------------
import urllib.request
from bs4 import BeautifulSoup as bs

# CLASS DESC -----------------------------------------------------------------
class Parser:

    # CONSTRUCTOR
    def __init__(self, url):
        self.soup = bs(urllib.request.urlopen(url).read(), "lxml")

    # METHODS
    def getMetaData(self):

        print(self.soup.prettify()[0:1000])

# MAIN FUNCTION --------------------------------------------------------------
if __name__ == "__main__":

    webSite = Parser("http://www.donnamoderna.com")
    webSite.getMetaData()

为此我得到以下输出：

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
     <head>
        <title>
            200 OK
        </title>
    </head>
    <body>
        <h1>
            Error 200 OK
        </h1>
        <p>
            OK
        </p>
        <h3>
            Guru Meditation:
        </h3>
        <p>
            XID: 1815743332
        </p>
        <hr/>
        <p>
            Varnish cache server
        </p>
    </body>
</html>

而且我不明白为什么会这样。它不是代理的东西；我尝试使用：

curl "http://www.donnamoderna.com"

而且效果很好。我还在 https://www.google.com 等不同的网站上尝试了代码，它工作得很好。是不是 http 协议不安全（即 https）？我应该更改代码中的某些内容吗？谢谢。

Answer 1

所以事实证明，问题是服务器正在读取我的请求作为 a-not-a-browser 请求，因此拒绝它访问请求的内容。我能够通过使用 requests lib 并更改请求的 header 来解决问题，以便“"confuse" 服务器（屏蔽我的来自浏览器的请求）如下：

import requests

# CONSTRUCTOR
def __init__(self, url):

    # Necessary to make the server think that we are a browser
    headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko)' 'Chrome/41.0.2227.1 Safari/537.36'}

    # Make request
    r = requests.get(url, headers=headers)

    # Create soup object
    self.soup =  bs(r.content, 'html.parser')

Web Scraping with Python: beautiful soup: bs4: <h1> Error 200 OK </h1>

Web Scraping with Python: beautiful soup: bs4: <h1> Error 200 OK </h1>

python

parsing

curl

screen-scraping

bs4