HtmlResponse 之间的区别 requests.response
Difference between HtmlResponse requests.response
我想提取this website webpage香水品牌的所有链接。
我想做一个能找到所有 href 的 scrapy scraper,就像我对 BeautifulSoup:
所做的那样
soup = BeautifulSoup(requests.get('https://www.nosetime.com'+ url,
headers=headers).content, 'html.parser')
print(soup)
result = soup.find_all('a', {'class': 'imgborder'})
for r in result:
brand_url = r.attrs['href']
requests.get returns a Response object这里
但是这种自制技术会因 403 错误而崩溃,所以我想制作一个 scrapy scraper,因为本教程声称它可以处理这些错误。
import scrapy
from bs4 import BeautifulSoup
class NosetimeScraper(scrapy.Spider):
name = "nosetime"
urls = ['/pinpai/2-a.html']
start_urls = ['https://www.nosetime.com' + url for url in urls]
def parse(self, response):
# proceed to other pages of the listings
soup = BeautifulSoup(response.content, 'html.parser')
results = soup.find_all('a', {'class': 'imgborder'})
for r in results:
brand_url = r.attrs['href']
yield scrapy.Request(url=brand_url, callback=self.parse)
# then do something with the scrapy.Request() response that has been yielded ...
但是 returns:
soup = BeautifulSoup(response.content, 'html.parser')
AttributeError: 'HtmlResponse' object has no attribute 'content'
所以我想 HtmlResponse 之间有区别 requests.response ?
那是因为 scrapy.http.HtmlResponse
不同于 requests.models.Response
。您可以改用 response.body
。
但是用.css()
和Scrapy会更好:
def parse(self, response):
urls = response.css('a.imgborder::attr(href)').getall()
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
我想提取this website webpage香水品牌的所有链接。
我想做一个能找到所有 href 的 scrapy scraper,就像我对 BeautifulSoup:
所做的那样 soup = BeautifulSoup(requests.get('https://www.nosetime.com'+ url,
headers=headers).content, 'html.parser')
print(soup)
result = soup.find_all('a', {'class': 'imgborder'})
for r in result:
brand_url = r.attrs['href']
requests.get returns a Response object这里
但是这种自制技术会因 403 错误而崩溃,所以我想制作一个 scrapy scraper,因为本教程声称它可以处理这些错误。
import scrapy
from bs4 import BeautifulSoup
class NosetimeScraper(scrapy.Spider):
name = "nosetime"
urls = ['/pinpai/2-a.html']
start_urls = ['https://www.nosetime.com' + url for url in urls]
def parse(self, response):
# proceed to other pages of the listings
soup = BeautifulSoup(response.content, 'html.parser')
results = soup.find_all('a', {'class': 'imgborder'})
for r in results:
brand_url = r.attrs['href']
yield scrapy.Request(url=brand_url, callback=self.parse)
# then do something with the scrapy.Request() response that has been yielded ...
但是 returns:
soup = BeautifulSoup(response.content, 'html.parser')
AttributeError: 'HtmlResponse' object has no attribute 'content'
所以我想 HtmlResponse 之间有区别 requests.response ?
那是因为 scrapy.http.HtmlResponse
不同于 requests.models.Response
。您可以改用 response.body
。
但是用.css()
和Scrapy会更好:
def parse(self, response):
urls = response.css('a.imgborder::attr(href)').getall()
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)