有什么方法可以使用 Scrapy 从 process_links 函数中检索响应 url?
Any way to retrieve response url from process_links function with Scrapy?
我正在尝试从 process_links
函数访问响应(url 作为条件),以便我可以重写 URL。有什么办法吗?目前,我收到错误消息:process_links() 正好接受 3 个参数(给定 2 个)
class Spider(CrawlSpider):
name = 'spider_1'
allowed_domains = 'domain.com',
start_urls = (
'http://domain.com/new/1.html?content=image',
'http://domain.com/new/1.html?content=video',
)
rules = [
Rule(LinkExtractor(allow = (), restrict_xpaths=('//div[@class="pagination"]')), callback='parse_page', process_links='process_links', follow=True)
]
def process_links(self, links, resp):
for link in links:
if 'content=photo' in resp.url:
link.url = "%s?content=photo" % link.url
else:
link.url = "%s?content=video" % link.url
return links
改变
def process_links(self, links, resp):
至
def process_links(self, links):
你希望在你的函数中收到响应,但 Scrapy 只给你链接。
也许这样的东西就是你想要的:
rules = [
Rule(LinkExtractor(allow = ('content=photo'), restrict_xpaths=('//div[@class="pagination"]')), callback='parse_page', process_links='process_photo_links', follow=True),
Rule(LinkExtractor(allow = (), restrict_xpaths=('//div[@class="pagination"]')), callback='parse_page', process_links='process_video_links', follow=True),
]
def process_photo_links(self, links, resp):
for link in links:
link.url = "%s?content=photo" % link.url
return links
def process_video_links(self, links, resp):
for link in links:
link.url = "%s?content=video" % link.url
return links
评论后更新:
是的,Scrapy 确实将响应传递给 process_links。
您可以简单地忽略规则并自己生成请求:
def parse_page(self, response):
...
links = LinkExtractor(allow = (), restrict_xpaths=('//div[@class="pagination"]')).extract_links(response)
for link in links:
if 'content=photo' in response.url:
link.url = "%s?content=photo" % link.url
else:
link.url = "%s?content=video" % link.url
yield scrapy.Request(link.url, callback=self.parse_page)
我正在尝试从 process_links
函数访问响应(url 作为条件),以便我可以重写 URL。有什么办法吗?目前,我收到错误消息:process_links() 正好接受 3 个参数(给定 2 个)
class Spider(CrawlSpider):
name = 'spider_1'
allowed_domains = 'domain.com',
start_urls = (
'http://domain.com/new/1.html?content=image',
'http://domain.com/new/1.html?content=video',
)
rules = [
Rule(LinkExtractor(allow = (), restrict_xpaths=('//div[@class="pagination"]')), callback='parse_page', process_links='process_links', follow=True)
]
def process_links(self, links, resp):
for link in links:
if 'content=photo' in resp.url:
link.url = "%s?content=photo" % link.url
else:
link.url = "%s?content=video" % link.url
return links
改变
def process_links(self, links, resp):
至
def process_links(self, links):
你希望在你的函数中收到响应,但 Scrapy 只给你链接。
也许这样的东西就是你想要的:
rules = [
Rule(LinkExtractor(allow = ('content=photo'), restrict_xpaths=('//div[@class="pagination"]')), callback='parse_page', process_links='process_photo_links', follow=True),
Rule(LinkExtractor(allow = (), restrict_xpaths=('//div[@class="pagination"]')), callback='parse_page', process_links='process_video_links', follow=True),
]
def process_photo_links(self, links, resp):
for link in links:
link.url = "%s?content=photo" % link.url
return links
def process_video_links(self, links, resp):
for link in links:
link.url = "%s?content=video" % link.url
return links
评论后更新:
是的,Scrapy 确实将响应传递给 process_links。 您可以简单地忽略规则并自己生成请求:
def parse_page(self, response):
...
links = LinkExtractor(allow = (), restrict_xpaths=('//div[@class="pagination"]')).extract_links(response)
for link in links:
if 'content=photo' in response.url:
link.url = "%s?content=photo" % link.url
else:
link.url = "%s?content=video" % link.url
yield scrapy.Request(link.url, callback=self.parse_page)