尝试抓取 ASP 网页时出现 HTTPError 500

Question

我想抓取 http://quotes.toscrape.com/search.aspx 以获取所有报价。尽管保存了 ___VIEWSTATE 参数并传递了它，但我仍收到错误 500。

import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/search.aspx'
#headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0"}
s = requests.Session()
#s.headers.update(headers)
page = s.get(url)
page.raise_for_status()

soup = BeautifulSoup(page.content)
authors = soup.find('select', id="author")
url = 'http://quotes.toscrape.com/filter.aspx'
for author in authors.stripped_strings:
    if author != '----------':
        parameters = {'tag'    : '----------'}
        parameters.update({"___VIEWSTATE" : soup.select_one("#__VIEWSTATE")["value"]})
        parameters.update({"author" : author}) #autor.replace(" ", "+") ?
        print(parameters)                                                   
        page = s.post(url, data = parameters)
        page.raise_for_status() #ERROR 500
        soup = BeautifulSoup(page.content)
        tags = soup.find('select', id="tag")
        parameters.update({"submit_button" : "Search"})
        for tag in tags.stripped_strings:
            if tag != '----------':
                parameters.update({"___VIEWSTATE" : soup.select_one("#__VIEWSTATE")["value"]})
                parameters.update({"tag" : tag})
                page = s.post(url, data = parameters)
                page.raise_for_status()
                soup = BeautifulSoup(page.content)
                print(author + "-" + tag)
                print(soup.find("span", class_="content").get_text())

Answer 1

import requests
from bs4 import BeautifulSoup

first = "http://quotes.toscrape.com/search.aspx"
second = "http://quotes.toscrape.com/filter.aspx"

data = {
    'tag': '----------'
}


def parse(page):
    soup = BeautifulSoup(page, 'html.parser')
    return soup


def main(url):
    with requests.Session() as req:
        r = req.get(url)
        soup = parse(r.content)
        authors = [aut.get_text(strip=True)
                   for aut in soup.findAll("option", value=True)]
        data['__VIEWSTATE'] = soup.find("input", id="__VIEWSTATE")['value']
        for auth in authors:
            print(f"Extracting Author {auth}")
            data['author'] = auth
            r = req.post(second, data=data)
            soup = parse(r.content)
            tags = [list(tag.stripped_strings)[1:]
                    for tag in soup.findAll("select", id="tag")][0]
            data['submit_button'] = "Search"
            for tag in tags:
                data['tag'] = tag
                r = req.post(second, data=data)
                soup = parse(r.content)
                goal = soup.select_one("span.content").text
                print("Tag --> {:<20} = {}".format(tag, goal))


main(first)

输出样本：

Extracting Author Albert Einstein
Tag --> change               = “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Tag --> deep-thoughts        = “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Tag --> thinking             = “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Tag --> world                = “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Tag --> inspirational        = “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Tag --> life                 = “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Tag --> live                 = “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Tag --> miracle              = “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Tag --> miracles             = “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Tag --> adulthood            = “Try not to become a man of success. Rather become a man of value.”
Tag --> success              = “Try not to become a man of success. Rather become a man of value.”
Tag --> value                = “Try not to become a man of success. Rather become a man of value.”
Tag --> simplicity           = “If you can't explain it to a six year old, you don't understand it yourself.”
Tag --> understand           = “If you can't explain it to a six year old, you don't understand it yourself.”
Tag --> children             = “If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales.”
Tag --> fairy-tales          = “If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales.”
Tag --> imagination          = “Logic will get you from A to Z; imagination will get you everywhere.”
Tag --> knowledge            = “Any fool can know. The point is to understand.”
Tag --> learning             = “Any fool can know. The point is to understand.”
Tag --> understanding        = “Any fool can know. The point is to understand.”
Tag --> wisdom               = “Any fool can know. The point is to understand.”
Tag --> simile               = “Life is like riding a bicycle. To keep your balance, you must keep moving.”
Tag --> music                = “If I were not a physicist, I would probably be a musician. I often think in music. I live my daydreams in music. I see my life in terms of music.”
Tag --> mistakes             = “Anyone who has never made a mistake has never tried anything new.”
Extracting Author J.K. Rowling
Tag --> abilities            = “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Tag --> choices              = “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Tag --> courage              = “It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”
Tag --> friends              = “It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”
Tag --> dumbledore           = “Of course it is happening inside your head, Harry, but why on earth should that mean that it is not real?”

Full Output: view online

Answer 2

只需在

中将“___VIEWSTATE”更改为“__VIEWSTATE”

parametros.update({"___VIEWSTATE" : soup.select_one("#__VIEWSTATE")["value"]})

尝试抓取 ASP 网页时出现 HTTPError 500

HTTPError 500 when trying to scrape an ASP webpage

python

viewstate

beautifulsoup

web-scraping

python-requests