Scrapy 不关注新请求
Scrapy doesn't follow new requests
我写过这段代码:
curl_command = "curl blah blah"
class MySpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['some_domain', ]
start_urls = ['someurl', ]
postal_codes = ['some_postal_code', ]
def start_requests(self):
for postal_code in self.postal_codes:
curl_req = scrapy.Request.from_curl(curl_command=curl_command)
curl_req._cb_kwargs = {'page': 0}
yield curl_req
def parse(self, response, **kwargs):
cur_page = kwargs.get('page', 1)
logging.info("Doing some logic")
num_pages = do_some_logic()
yield mySpiderItem
if cur_page < num_pages:
logging.info("New Request")
curl_req = scrapy.Request.from_curl(curl_command=curl_command)
curl_req._cb_kwargs = {'page': cur_page + 1}
yield curl_req
yield scrapy.Request(url="https://jsonplaceholder.typicode.com/posts")
现在的问题是 parse 方法只被调用一次。换句话说,日志看起来像这样:
Doing some logic
New Request
Spider closing
我不明白新请求发生了什么。从逻辑上讲,新请求也应该导致 Doing some logic
日志,但由于某些原因它没有。
我在这里遗漏了什么吗?是否有其他方法来生成新请求?
我想你忘记了请求中的回调部分。检查我从文档中获得的代码。在你的情况下应该是 callback=self.parse
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
return [scrapy.FormRequest("http://www.example.com/login",
formdata={'user': 'john', 'pass': 'secret'},
callback=self.logged_in)]
def logged_in(self, response):
# here you would extract links to follow and return Requests for
# each of them, with another callback
pass
很难从代码示例中确切知道问题出在哪里,但我猜可能是您没有在请求中使用页码。
举个例子,我为其他网站修改了你的代码:
import scrapy
import logging
curl_command = 'curl "https://scrapingclub.com/exercise/list_basic/"'
class MySpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['scrapingclub.com']
#start_urls = ['someurl', ]
postal_codes = ['some_postal_code', ]
def start_requests(self):
for postal_code in self.postal_codes:
curl_req = scrapy.Request.from_curl(curl_command=curl_command, dont_filter=True)
curl_req._cb_kwargs = {'page': 1}
yield curl_req
def parse(self, response, **kwargs):
cur_page = kwargs.get('page', 1)
logging.info("Doing some logic")
#num_pages = do_some_logic()
#yield mySpiderItem
num_pages = 4
if cur_page < num_pages:
logging.info("New Request")
curl_req = scrapy.Request.from_curl(curl_command=f'{curl_command}?page={str(cur_page + 1)}', dont_filter=True)
curl_req._cb_kwargs = {'page': cur_page + 1}
yield curl_req
yield scrapy.Request(url="https://jsonplaceholder.typicode.com/posts")
输出:
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://scrapingclub.com/exercise/list_basic/> (referer: None)
[root] INFO: Doing some logic
[root] INFO: New Request
[scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'jsonplaceholder.typicode.com': <GET https://jsonplaceholder.typicode.com/posts>
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://scrapingclub.com/exercise/list_basic/?page=2> (referer: https://scrapingclub.com/exercise/list_basic/)
[root] INFO: Doing some logic
[root] INFO: New Request
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://scrapingclub.com/exercise/list_basic/?page=3> (referer: https://scrapingclub.com/exercise/list_basic/?page=2)
[root] INFO: Doing some logic
[root] INFO: New Request
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://scrapingclub.com/exercise/list_basic/?page=4> (referer: https://scrapingclub.com/exercise/list_basic/?page=3)
Scrapy 有一个默认启用的内置重复过滤器。如果您不想要这种行为,您可以设置 'dont_filter = True' 以避免忽略重复请求。
我写过这段代码:
curl_command = "curl blah blah"
class MySpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['some_domain', ]
start_urls = ['someurl', ]
postal_codes = ['some_postal_code', ]
def start_requests(self):
for postal_code in self.postal_codes:
curl_req = scrapy.Request.from_curl(curl_command=curl_command)
curl_req._cb_kwargs = {'page': 0}
yield curl_req
def parse(self, response, **kwargs):
cur_page = kwargs.get('page', 1)
logging.info("Doing some logic")
num_pages = do_some_logic()
yield mySpiderItem
if cur_page < num_pages:
logging.info("New Request")
curl_req = scrapy.Request.from_curl(curl_command=curl_command)
curl_req._cb_kwargs = {'page': cur_page + 1}
yield curl_req
yield scrapy.Request(url="https://jsonplaceholder.typicode.com/posts")
现在的问题是 parse 方法只被调用一次。换句话说,日志看起来像这样:
Doing some logic
New Request
Spider closing
我不明白新请求发生了什么。从逻辑上讲,新请求也应该导致 Doing some logic
日志,但由于某些原因它没有。
我在这里遗漏了什么吗?是否有其他方法来生成新请求?
我想你忘记了请求中的回调部分。检查我从文档中获得的代码。在你的情况下应该是 callback=self.parse
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
return [scrapy.FormRequest("http://www.example.com/login",
formdata={'user': 'john', 'pass': 'secret'},
callback=self.logged_in)]
def logged_in(self, response):
# here you would extract links to follow and return Requests for
# each of them, with another callback
pass
很难从代码示例中确切知道问题出在哪里,但我猜可能是您没有在请求中使用页码。
举个例子,我为其他网站修改了你的代码:
import scrapy
import logging
curl_command = 'curl "https://scrapingclub.com/exercise/list_basic/"'
class MySpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['scrapingclub.com']
#start_urls = ['someurl', ]
postal_codes = ['some_postal_code', ]
def start_requests(self):
for postal_code in self.postal_codes:
curl_req = scrapy.Request.from_curl(curl_command=curl_command, dont_filter=True)
curl_req._cb_kwargs = {'page': 1}
yield curl_req
def parse(self, response, **kwargs):
cur_page = kwargs.get('page', 1)
logging.info("Doing some logic")
#num_pages = do_some_logic()
#yield mySpiderItem
num_pages = 4
if cur_page < num_pages:
logging.info("New Request")
curl_req = scrapy.Request.from_curl(curl_command=f'{curl_command}?page={str(cur_page + 1)}', dont_filter=True)
curl_req._cb_kwargs = {'page': cur_page + 1}
yield curl_req
yield scrapy.Request(url="https://jsonplaceholder.typicode.com/posts")
输出:
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://scrapingclub.com/exercise/list_basic/> (referer: None)
[root] INFO: Doing some logic
[root] INFO: New Request
[scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'jsonplaceholder.typicode.com': <GET https://jsonplaceholder.typicode.com/posts>
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://scrapingclub.com/exercise/list_basic/?page=2> (referer: https://scrapingclub.com/exercise/list_basic/)
[root] INFO: Doing some logic
[root] INFO: New Request
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://scrapingclub.com/exercise/list_basic/?page=3> (referer: https://scrapingclub.com/exercise/list_basic/?page=2)
[root] INFO: Doing some logic
[root] INFO: New Request
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://scrapingclub.com/exercise/list_basic/?page=4> (referer: https://scrapingclub.com/exercise/list_basic/?page=3)
Scrapy 有一个默认启用的内置重复过滤器。如果您不想要这种行为,您可以设置 'dont_filter = True' 以避免忽略重复请求。