Scrapy - TypeError: 'Rule' object is not iterable
Scrapy - TypeError: 'Rule' object is not iterable
我试图从这个网站上抓取标题 (https://minerals.usgs.gov/science/mineral-deposit-database/#products)。我正在使用爬行蜘蛛,因为我打算稍后从页面中的每个 url 获取更多信息!
但出现类型错误:'Rule' object 不可迭代!
这是我使用的代码:
import scrapy
import datetime
import socket
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from usgs.items import MineralItem
from scrapy.loader import ItemLoader
class MineralSpider(CrawlSpider):
name = 'mineral'
allowed_domains = ['web']
start_urls = 'https://minerals.usgs.gov/science/mineral-deposit-
database/#products'
rules = (
Rule(LinkExtractor(
restrict_xpaths='//*[@id="products"][1]/p/a'),
callback='parse')
)
def parse(self, response):
it = ItemLoader(item=MineralItem(), response=response)
it.add_xpath('name', '//*[@class="container"]/header/h1/text()')
it.add_value('url', response.url)
it.add_value('project', self.settings.get('BOT_NAME'))
it.add_value('spider', self.name)
it.add_value('server', socket.gethostname())
it.add_value('date', datetime.datetime.now())
return it.load_item()
日志消息:
(base) C:\Users\User\Documents\Python WebCrawling Learing
Projects\usgs\usgs\spiders>scrapy crawl mineral
2018-11-16 17:43:03 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot:
usgs)
2018-11-16 17:43:03 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2
2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.7.0, Python
3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)],
pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1, Platform
Windows-10-10.0.17134-SP0
2018-11-16 17:43:03 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME':
'usgs', 'NEWSPIDER_MODULE': 'usgs.spiders', 'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['usgs.spiders']}
2018-11-16 17:43:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
Unhandled error in Deferred:
2018-11-16 17:43:03 [twisted] CRITICAL: Unhandled error in Deferred:
2018-11-16 17:43:03 [twisted] CRITICAL:
Traceback (most recent call last):
File "C:\Users\User\Anaconda3\lib\site-
packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\crawler.py", line
79, in crawl
self.spider = self._create_spider(*args, **kwargs)
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\crawler.py", line
102, in _create_spider
return self.spidercls.from_crawler(self, *args, **kwargs)
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\spiders\crawl.py",
line 100, in from_crawler
spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
File "C:\Users\User\Anaconda3\lib\site-
packages\scrapy\spiders\__init__.py", line 51, in from_crawler
spider = cls(*args, **kwargs)
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\spiders\crawl.py",
line 40, in __init__
self._compile_rules()
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\spiders\crawl.py",
line 92, in _compile_rules
self._rules = [copy.copy(r) for r in self.rules]
TypeError: 'Rule' object is not iterable
有什么想法吗?
在您的 Rule
对象后添加一个逗号,以便它认为它是一个有效的元组:
rules = (
Rule(LinkExtractor(
restrict_xpaths='//*[@id="products"][1]/p/a'),
callback='parse'),
)
您可能还想看看这个答案:Why does adding a trailing comma after a variable name make it a tuple?
我试图从这个网站上抓取标题 (https://minerals.usgs.gov/science/mineral-deposit-database/#products)。我正在使用爬行蜘蛛,因为我打算稍后从页面中的每个 url 获取更多信息!
但出现类型错误:'Rule' object 不可迭代! 这是我使用的代码:
import scrapy
import datetime
import socket
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from usgs.items import MineralItem
from scrapy.loader import ItemLoader
class MineralSpider(CrawlSpider):
name = 'mineral'
allowed_domains = ['web']
start_urls = 'https://minerals.usgs.gov/science/mineral-deposit-
database/#products'
rules = (
Rule(LinkExtractor(
restrict_xpaths='//*[@id="products"][1]/p/a'),
callback='parse')
)
def parse(self, response):
it = ItemLoader(item=MineralItem(), response=response)
it.add_xpath('name', '//*[@class="container"]/header/h1/text()')
it.add_value('url', response.url)
it.add_value('project', self.settings.get('BOT_NAME'))
it.add_value('spider', self.name)
it.add_value('server', socket.gethostname())
it.add_value('date', datetime.datetime.now())
return it.load_item()
日志消息:
(base) C:\Users\User\Documents\Python WebCrawling Learing
Projects\usgs\usgs\spiders>scrapy crawl mineral
2018-11-16 17:43:03 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot:
usgs)
2018-11-16 17:43:03 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2
2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.7.0, Python
3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)],
pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1, Platform
Windows-10-10.0.17134-SP0
2018-11-16 17:43:03 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME':
'usgs', 'NEWSPIDER_MODULE': 'usgs.spiders', 'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['usgs.spiders']}
2018-11-16 17:43:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
Unhandled error in Deferred:
2018-11-16 17:43:03 [twisted] CRITICAL: Unhandled error in Deferred:
2018-11-16 17:43:03 [twisted] CRITICAL:
Traceback (most recent call last):
File "C:\Users\User\Anaconda3\lib\site-
packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\crawler.py", line
79, in crawl
self.spider = self._create_spider(*args, **kwargs)
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\crawler.py", line
102, in _create_spider
return self.spidercls.from_crawler(self, *args, **kwargs)
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\spiders\crawl.py",
line 100, in from_crawler
spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
File "C:\Users\User\Anaconda3\lib\site-
packages\scrapy\spiders\__init__.py", line 51, in from_crawler
spider = cls(*args, **kwargs)
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\spiders\crawl.py",
line 40, in __init__
self._compile_rules()
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\spiders\crawl.py",
line 92, in _compile_rules
self._rules = [copy.copy(r) for r in self.rules]
TypeError: 'Rule' object is not iterable
有什么想法吗?
在您的 Rule
对象后添加一个逗号,以便它认为它是一个有效的元组:
rules = (
Rule(LinkExtractor(
restrict_xpaths='//*[@id="products"][1]/p/a'),
callback='parse'),
)
您可能还想看看这个答案:Why does adding a trailing comma after a variable name make it a tuple?