Scrapy 在循环中抓取一页 'n' 次,但其他一次
Scrapy scrapes one page 'n' times but other single time when in a loop
我正在为一个 id 迭代地抓取两个页面。第一个抓取器适用于所有 ID,但第二个抓取器仅适用于一个 ID。
class MySpider(scrapy.Spider):
name = "scraper"
allowed_domains = ["example.com"]
start_urls = ['http://example.com/viewData']
def parse(self, response):
ids = ['1', '2', '3']
for id in ids:
# The following method scraps for all id's
yield scrapy.Form.Request.from_response(response,
...
callback=self.parse1)
# The following method scrapes only for 1st id
yield Request(url="http://example.com/viewSomeOtherData",
callback=self.intermediateMethod)
def parse1(self, response):
# Data scraped here using selectors
def intermediateMethod(self, response):
yield scrapy.FormRequest.from_response(response,
...
callback=self.parse2)
def parse2(self, response):
# Some other data scraped here
我想为一个 ID 废弃两个不同的页面。
更改以下行:
yield Request(url="http://example.com/viewSomeOtherData",
callback=self.intermediateMethod)
至:
yield Request(url="http://example.com/viewSomeOtherData",
callback=self.intermediateMethod,
dont_filter=True)
对我有用。
Scrapy 有一个重复的 URL 过滤器,这可能是在过滤你的请求。尝试按照史蒂夫的建议添加 dont_filter = True 后回调。
我正在为一个 id 迭代地抓取两个页面。第一个抓取器适用于所有 ID,但第二个抓取器仅适用于一个 ID。
class MySpider(scrapy.Spider):
name = "scraper"
allowed_domains = ["example.com"]
start_urls = ['http://example.com/viewData']
def parse(self, response):
ids = ['1', '2', '3']
for id in ids:
# The following method scraps for all id's
yield scrapy.Form.Request.from_response(response,
...
callback=self.parse1)
# The following method scrapes only for 1st id
yield Request(url="http://example.com/viewSomeOtherData",
callback=self.intermediateMethod)
def parse1(self, response):
# Data scraped here using selectors
def intermediateMethod(self, response):
yield scrapy.FormRequest.from_response(response,
...
callback=self.parse2)
def parse2(self, response):
# Some other data scraped here
我想为一个 ID 废弃两个不同的页面。
更改以下行:
yield Request(url="http://example.com/viewSomeOtherData",
callback=self.intermediateMethod)
至:
yield Request(url="http://example.com/viewSomeOtherData",
callback=self.intermediateMethod,
dont_filter=True)
对我有用。
Scrapy 有一个重复的 URL 过滤器,这可能是在过滤你的请求。尝试按照史蒂夫的建议添加 dont_filter = True 后回调。