Trying to make a recursive crawl spider with python. SyntaxError: non-keyword arg after keyword arg
Trying to make a recursive crawl spider with python. SyntaxError: non-keyword arg after keyword arg
我正在尝试抓取超过一个页面,我的功能确实 returns 第一个开始 url,但我无法使蜘蛛的工作规则。
这是我目前的情况:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from craigslist_sample.items import CraigslistSampleItem
class MySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/npo/"]
rules = (
Rule(SgmlLinkExtractor(allow=('.*?s=.*',), restrict_xpaths('a[@class="button next"]',)), callback='parse', follow=True),)
def parse(self, response):
for sel in response.xpath('//span[@class="pl"]'):
item = CraigslistSampleItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
yield item`
我收到这个错误
SyntaxError: non-keyword arg after keyword arg
更新:
感谢下面的回答。没有语法错误,但我的爬虫只是停留在同一个页面,没有爬取。
更新代码
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from craigslist_sample.items import CraigslistSampleItem
from scrapy.contrib.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/npo/"]
rules = (Rule(SgmlLinkExtractor(allow=['.*?s=.*'], restrict_xpaths=('a[@class="button next"]')),
callback='parse', follow=True, ),
)
def parse(self, response):
for sel in response.xpath('//span[@class="pl"]'):
item = CraigslistSampleItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
yield item
您的问题与此类似(Python 3)
>>> print("hello")
hello
>>> print("hello", end=",,")
hello,,
>>> print(end=",,", "hello")
SyntaxError: non-keyword arg after keyword arg
行:
Rule(SgmlLinkExtractor(allow=('.*?s=.*',), restrict_xpaths('a[@class="button next"]',)), callback='parse', follow=True),)
必须被称为:
Rule(SgmlLinkExtractor(restrict_xpaths('a[@class="button next"]'),allow=('.*?s=.*',)), callback='parse', follow=True),)
好的,所以我发现我使用方法解析时遇到的问题是什么:
def parse(self, response):
for sel in response.xpath('//span[@class="pl"]'):
item = CraigslistSampleItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
yield item
阅读本文后我发现了我的问题。
http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.CrawlSpider
CrawlSpider 使用解析作为方法,所以我不得不将我的函数重命名为:
def parse_item(self, response):
for sel in response.xpath('//span[@class="pl"]'):
item = CraigslistSampleItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
yield item
我正在尝试抓取超过一个页面,我的功能确实 returns 第一个开始 url,但我无法使蜘蛛的工作规则。
这是我目前的情况:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from craigslist_sample.items import CraigslistSampleItem
class MySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/npo/"]
rules = (
Rule(SgmlLinkExtractor(allow=('.*?s=.*',), restrict_xpaths('a[@class="button next"]',)), callback='parse', follow=True),)
def parse(self, response):
for sel in response.xpath('//span[@class="pl"]'):
item = CraigslistSampleItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
yield item`
我收到这个错误
SyntaxError: non-keyword arg after keyword arg
更新:
感谢下面的回答。没有语法错误,但我的爬虫只是停留在同一个页面,没有爬取。
更新代码
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from craigslist_sample.items import CraigslistSampleItem
from scrapy.contrib.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/npo/"]
rules = (Rule(SgmlLinkExtractor(allow=['.*?s=.*'], restrict_xpaths=('a[@class="button next"]')),
callback='parse', follow=True, ),
)
def parse(self, response):
for sel in response.xpath('//span[@class="pl"]'):
item = CraigslistSampleItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
yield item
您的问题与此类似(Python 3)
>>> print("hello")
hello
>>> print("hello", end=",,")
hello,,
>>> print(end=",,", "hello")
SyntaxError: non-keyword arg after keyword arg
行:
Rule(SgmlLinkExtractor(allow=('.*?s=.*',), restrict_xpaths('a[@class="button next"]',)), callback='parse', follow=True),)
必须被称为:
Rule(SgmlLinkExtractor(restrict_xpaths('a[@class="button next"]'),allow=('.*?s=.*',)), callback='parse', follow=True),)
好的,所以我发现我使用方法解析时遇到的问题是什么:
def parse(self, response):
for sel in response.xpath('//span[@class="pl"]'):
item = CraigslistSampleItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
yield item
阅读本文后我发现了我的问题。 http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.CrawlSpider
CrawlSpider 使用解析作为方法,所以我不得不将我的函数重命名为:
def parse_item(self, response):
for sel in response.xpath('//span[@class="pl"]'):
item = CraigslistSampleItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
yield item