HtmlResponse 之间的区别 requests.response

Question

我想提取this website webpage香水品牌的所有链接。

我想做一个能找到所有 href 的 scrapy scraper，就像我对 BeautifulSoup:

所做的那样

    soup = BeautifulSoup(requests.get('https://www.nosetime.com'+ url, 
                                      headers=headers).content, 'html.parser')
    print(soup)
    result = soup.find_all('a', {'class': 'imgborder'})
    for r in result:
        brand_url = r.attrs['href']

requests.get returns a Response object这里

但是这种自制技术会因 403 错误而崩溃，所以我想制作一个 scrapy scraper，因为本教程声称它可以处理这些错误。

import scrapy
from bs4 import BeautifulSoup

class NosetimeScraper(scrapy.Spider):
    name = "nosetime"
    urls = ['/pinpai/2-a.html']
    start_urls = ['https://www.nosetime.com' + url for url in urls]

    def parse(self, response):
        # proceed to other pages of the listings
        soup = BeautifulSoup(response.content, 'html.parser')
        results = soup.find_all('a', {'class': 'imgborder'})
        for r in results:
            brand_url = r.attrs['href']
            yield scrapy.Request(url=brand_url, callback=self.parse)

        # then do something with the scrapy.Request() response that has been yielded ...

但是 returns:

soup = BeautifulSoup(response.content, 'html.parser')
AttributeError: 'HtmlResponse' object has no attribute 'content'

所以我想 HtmlResponse 之间有区别 requests.response ?

Answer 1

那是因为 scrapy.http.HtmlResponse 不同于 requests.models.Response。您可以改用 response.body。

但是用.css()和Scrapy会更好:

def parse(self, response):
    urls = response.css('a.imgborder::attr(href)').getall()
    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse)

HtmlResponse 之间的区别 requests.response

Difference between HtmlResponse requests.response

response

request

python-3.x

python-requests