Scrapy Follow & Scrape 下一页
Scrapy Follow & Scrape next Pages
我遇到了一个问题,我的 none 个 scrapy 蜘蛛会抓取一个网站,只需抓取一页并抓住。我的印象是 rules
成员变量对此负责,但我无法让它跟随任何链接。我一直在关注这里的文档:http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider
让我的 none 机器人爬行的原因是什么?
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import Selector
from Example.items import ExItem
class ExampleSpider(CrawlSpider):
name = "example"
allowed_domains = ["example.ac.uk"]
start_urls = (
'http://www.example.ac.uk',
)
rules = ( Rule (LinkExtractor(allow=("", ),),
callback="parse_items", follow= True),
)
用这个替换你的规则:
rules = ( Rule(LinkExtractor(allow=('course-finder', ),restrict_xpaths=('//div[@class="pagination"]',)), callback='parse_items',follow=True), )
我遇到了一个问题,我的 none 个 scrapy 蜘蛛会抓取一个网站,只需抓取一页并抓住。我的印象是 rules
成员变量对此负责,但我无法让它跟随任何链接。我一直在关注这里的文档:http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider
让我的 none 机器人爬行的原因是什么?
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import Selector
from Example.items import ExItem
class ExampleSpider(CrawlSpider):
name = "example"
allowed_domains = ["example.ac.uk"]
start_urls = (
'http://www.example.ac.uk',
)
rules = ( Rule (LinkExtractor(allow=("", ),),
callback="parse_items", follow= True),
)
用这个替换你的规则:
rules = ( Rule(LinkExtractor(allow=('course-finder', ),restrict_xpaths=('//div[@class="pagination"]',)), callback='parse_items',follow=True), )