Trying to make a recursive crawl spider with python. SyntaxError: non-keyword arg after keyword arg

Trying to make a recursive crawl spider with python. SyntaxError: non-keyword arg after keyword arg

我正在尝试抓取超过一个页面,我的功能确实 returns 第一个开始 url,但我无法使蜘蛛的工作规则。

这是我目前的情况:

import scrapy

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from craigslist_sample.items import CraigslistSampleItem



class MySpider(CrawlSpider):
    name = "craigs"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/npo/"]



    rules = (
        Rule(SgmlLinkExtractor(allow=('.*?s=.*',), restrict_xpaths('a[@class="button next"]',)), callback='parse', follow=True),)

    def parse(self, response):
        for sel in response.xpath('//span[@class="pl"]'):
            item = CraigslistSampleItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            yield item`

我收到这个错误

SyntaxError: non-keyword arg after keyword arg

更新:

感谢下面的回答。没有语法错误,但我的爬虫只是停留在同一个页面,没有爬取。

更新代码

import scrapy

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from craigslist_sample.items import CraigslistSampleItem
from scrapy.contrib.linkextractors import LinkExtractor


class MySpider(CrawlSpider):
    name = "craigs"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/npo/"]

    rules = (Rule(SgmlLinkExtractor(allow=['.*?s=.*'], restrict_xpaths=('a[@class="button next"]')), 
        callback='parse', follow=True, ),
)


    def parse(self, response):
        for sel in response.xpath('//span[@class="pl"]'):
            item = CraigslistSampleItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            yield item

您的问题与此类似(Python 3)

>>> print("hello")
hello
>>> print("hello", end=",,")
hello,,
>>> print(end=",,", "hello")
SyntaxError: non-keyword arg after keyword arg

行:

Rule(SgmlLinkExtractor(allow=('.*?s=.*',), restrict_xpaths('a[@class="button next"]',)), callback='parse', follow=True),)

必须被称为:

Rule(SgmlLinkExtractor(restrict_xpaths('a[@class="button next"]'),allow=('.*?s=.*',)), callback='parse', follow=True),)

好的,所以我发现我使用方法解析时遇到的问题是什么:

def parse(self, response):
    for sel in response.xpath('//span[@class="pl"]'):
        item = CraigslistSampleItem()
        item['title'] = sel.xpath('a/text()').extract()
        item['link'] = sel.xpath('a/@href').extract()
        yield item 

阅读本文后我发现了我的问题。 http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.CrawlSpider

CrawlSpider 使用解析作为方法,所以我不得不将我的函数重命名为:

def parse_item(self, response):
    for sel in response.xpath('//span[@class="pl"]'):
        item = CraigslistSampleItem()
        item['title'] = sel.xpath('a/text()').extract()
        item['link'] = sel.xpath('a/@href').extract()
        yield item